Configure Kubernetes Plugins#
This guide provides an overview of setting up the Kubernetes Operator backend plugin in your Flyte deployment.
Spin up a cluster#
Enable the PyTorch plugin on the demo cluster by adding the following block to ~/.flyte/sandbox/config.yaml
:
tasks:
task-plugins:
default-for-task-types:
container: container
container_array: k8s-array
sidecar: sidecar
pytorch: pytorch
enabled-plugins:
- container
- k8s-array
- sidecar
- pytorch
Enable the TensorFlow plugin on the demo cluster by adding the following block to ~/.flyte/sandbox/config.yaml
:
tasks:
task-plugins:
default-for-task-types:
container: container
container_array: k8s-array
sidecar: sidecar
tensorflow: tensorflow
enabled-plugins:
- container
- k8s-array
- sidecar
- tensorflow
Enable the MPI plugin on the demo cluster by adding the following block to ~/.flyte/sandbox/config.yaml
:
tasks:
task-plugins:
default-for-task-types:
container: container
container_array: k8s-array
sidecar: sidecar
mpi: mpi
enabled-plugins:
- container
- k8s-array
- sidecar
- mpi
Enable the Ray plugin on the demo cluster by adding the following block to ~/.flyte/sandbox/config.yaml
:
tasks:
task-plugins:
default-for-task-types:
container: container
container_array: k8s-array
sidecar: sidecar
ray: ray
enabled-plugins:
- container
- k8s-array
- sidecar
- ray
Enable the Spark plugin on the demo cluster by adding the following config to ~/.flyte/sandbox/config.yaml
:
tasks:
task-plugins:
default-for-task-types:
container: container
container_array: k8s-array
sidecar: sidecar
spark: spark
enabled-plugins:
- container
- sidecar
- k8s-array
- spark
plugins:
spark:
spark-config-default:
- spark.driver.cores: "1"
- spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
- spark.hadoop.fs.s3a.endpoint: "http://minio.flyte:9000"
- spark.hadoop.fs.s3a.access.key: "minio"
- spark.hadoop.fs.s3a.secret.key: "miniostorage"
- spark.hadoop.fs.s3a.path.style.access: "true"
- spark.kubernetes.allocation.batch.size: "50"
- spark.hadoop.fs.s3a.acl.default: "BucketOwnerFullControl"
- spark.hadoop.fs.s3n.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
- spark.hadoop.fs.AbstractFileSystem.s3n.impl: "org.apache.hadoop.fs.s3a.S3A"
- spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
- spark.hadoop.fs.AbstractFileSystem.s3.impl: "org.apache.hadoop.fs.s3a.S3A"
- spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
- spark.hadoop.fs.AbstractFileSystem.s3a.impl: "org.apache.hadoop.fs.s3a.S3A"
cluster_resources:
refreshInterval: 5m
customData:
- production:
- projectQuotaCpu:
value: "5"
- projectQuotaMemory:
value: "4000Mi"
- staging:
- projectQuotaCpu:
value: "2"
- projectQuotaMemory:
value: "3000Mi"
- development:
- projectQuotaCpu:
value: "4"
- projectQuotaMemory:
value: "5000Mi"
refresh: 5m
Also add the following cluster resource templates to the ~/.flyte/sandbox/cluster-resource-templates
directory:
serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: "{{ namespace }}"
annotations:
eks.amazonaws.com/role-arn: "{{ defaultIamRole }}"
spark_role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: spark-role
namespace: "{{ namespace }}"
rules:
- apiGroups:
- ""
resources:
- pods
- services
- configmaps
verbs:
- "*"
spark_service_account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: "{{ namespace }}"
annotations:
eks.amazonaws.com/role-arn: "{{ defaultIamRole }}"
spark_role_binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-role-binding
namespace: "{{ namespace }}"
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: spark-role
subjects:
- kind: ServiceAccount
name: spark
namespace: "{{ namespace }}"
Enable the Dask plugin on the demo cluster by adding the following block to ~/.flyte/sandbox/config.yaml
:
tasks:
task-plugins:
default-for-task-types:
container: container
container_array: k8s-array
sidecar: sidecar
dask: dask
enabled-plugins:
- container
- k8s-array
- sidecar
- dask
Start the demo cluster by running the following command:
flytectl demo start
Install Flyte using the flyte-binary helm chart.
If you hae installed Flyte using the flyte-core helm chart, please ensure:
You have the correct kubeconfig and have selected the correct Kubernetes context.
You have configured the correct flytectl settings in
~/.flyte/config.yaml
.
Note
Add the Flyte chart repo to Helm if you’re installing via the Helm charts.
helm repo add flyteorg https://flyteorg.github.io/flyte
Install the Kubernetes operator#
First, install kustomize.
Build and apply the training-operator.
export KUBECONFIG=$KUBECONFIG:~/.kube/config:~/.flyte/k3s/k3s.yaml
kustomize build "https://github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.5.0" | kubectl apply -f -
Optional: Using a gang scheduler
To address potential issues with worker pods of distributed training jobs being scheduled at different times due to resource constraints, you can opt for a gang scheduler. This ensures that all worker pods are scheduled simultaneously, reducing the likelihood of job failures caused by timeout errors.
To enable gang scheduling for the Kubeflow training-operator, you can install the Kubernetes scheduler plugins or the Apache YuniKorn scheduler.
Install the scheduler plugin or Apache YuniKorn as a second scheduler.
Configure the Kubeflow training-operator to use the new scheduler:
Create a manifest called
kustomization.yaml
with the following content:apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - github.com/kubeflow/training-operator/manifests/overlays/standalone patchesStrategicMerge: - patch.yaml
Create a patch file called
patch.yaml
with the following content:apiVersion: apps/v1 kind: Deployment metadata: name: training-operator spec: template: spec: containers: - name: training-operator command: - /manager - --gang-scheduler-name=<scheduler-plugins/yunikorn>
Install the patched kustomization with the following command:
kustomize build path/to/overlay/directory | kubectl apply -f -
(Only for Apache YuniKorn) To configure gang scheduling with Apache YuniKorn, make sure to set the following annotations in Flyte pod templates:
template.metadata.annotations.yunikorn.apache.org/task-group-name
template.metadata.annotations.yunikorn.apache.org/task-groups
template.metadata.annotations.yunikorn.apache.org/schedulingPolicyParameters
For more configuration details, refer to the Apache YuniKorn Gang-Scheduling documentation.
Use a Flyte pod template with
template.spec.schedulerName: scheduler-plugins-scheduler
to use the new gang scheduler for your tasks.See the Using K8s PodTemplates section for more information on pod templates in Flyte. You can set the scheduler name in the pod template passed to the
@task
decorator. However, to prevent the two different schedulers from competing for resources, it is recommended to set the scheduler name in the pod template in theflyte
namespace which is applied to all tasks. Non distributed training tasks can be scheduled by the gang scheduler as well.For more information on pod templates in Flyte, refer to the Using K8s PodTemplates section. You can set the scheduler name in the pod template passed to the
@task
decorator. However, to avoid resource competition between the two different schedulers, it is recommended to set the scheduler name in the pod template in theflyte
namespace, which is applied to all tasks. This allows non-distributed training tasks to be scheduled by the gang scheduler as well.
To install the Ray Operator, run the following commands:
export KUBERAY_VERSION=v0.5.2
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}&timeout=90s"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=${KUBERAY_VERSION}&timeout=90s"
To add the Spark repository, run the following commands:
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
To install the Spark operator, run the following command:
helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace
To add the Dask repository, run the following command:
helm repo add dask https://helm.dask.org
To install the Dask operator, run the following command:
helm install dask-operator dask/dask-kubernetes-operator --namespace dask-operator --create-namespace
Specify plugin configuration#
To specify the plugin when using the Helm chart, edit the relevant YAML file.
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- pytorch
default-for-task-types:
- container: container
- container_array: k8s-array
- pytorch: pytorch
Create a file named values-override.yaml
and add the following config to it:
configmap:
enabled_plugins:
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- pytorch
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
pytorch: pytorch
To specify the plugin when using the Helm chart, edit the relevant YAML file.
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- tensorflow
default-for-task-types:
- container: container
- container_array: k8s-array
- tensorflow: tensorflow
Create a file named values-override.yaml
and add the following config to it:
configmap:
enabled_plugins:
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- tensorflow
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
tensorflow: tensorflow
To specify the plugin when using the Helm chart, edit the relevant YAML file.
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- mpi
default-for-task-types:
- container: container
- container_array: k8s-array
- mpi: mpi
Create a file named values-override.yaml
and add the following config to it:
configmap:
enabled_plugins:
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- mpi
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
mpi: mpi
To specify the plugin when using the Helm chart, edit the relevant YAML file.
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- ray
default-for-task-types:
- container: container
- container_array: k8s-array
- ray: ray
Create a file named values-override.yaml
and add the following config to it:
configmap:
enabled_plugins:
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- ray
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
ray: ray
To specify the plugin when using the Helm chart, edit the relevant YAML file.
Create a file named values-override.yaml
and add the following config to it:
cluster_resource_manager:
enabled: true
config:
cluster_resources:
refreshInterval: 5m
templatePath: "/etc/flyte/clusterresource/templates"
customData:
- production:
- projectQuotaCpu:
value: "5"
- projectQuotaMemory:
value: "4000Mi"
- staging:
- projectQuotaCpu:
value: "2"
- projectQuotaMemory:
value: "3000Mi"
- development:
- projectQuotaCpu:
value: "4"
- projectQuotaMemory:
value: "3000Mi"
refresh: 5m
# -- Resource templates that should be applied
templates:
# -- Template for namespaces resources
- key: aa_namespace
value: |
apiVersion: v1
kind: Namespace
metadata:
name: {{ namespace }}
spec:
finalizers:
- kubernetes
- key: ab_project_resource_quota
value: |
apiVersion: v1
kind: ResourceQuota
metadata:
name: project-quota
namespace: {{ namespace }}
spec:
hard:
limits.cpu: {{ projectQuotaCpu }}
limits.memory: {{ projectQuotaMemory }}
- key: ac_spark_role
value: |
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
name: spark-role
namespace: {{ namespace }}
rules:
- apiGroups: ["*"]
resources:
- pods
verbs:
- '*'
- apiGroups: ["*"]
resources:
- services
verbs:
- '*'
- apiGroups: ["*"]
resources:
- configmaps
verbs:
- '*'
- key: ad_spark_service_account
value: |
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: {{ namespace }}
- key: ae_spark_role_binding
value: |
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
name: spark-role-binding
namespace: {{ namespace }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: spark-role
subjects:
- kind: ServiceAccount
name: spark
namespace: {{ namespace }}
sparkoperator:
enabled: true
plugin_config:
plugins:
spark:
# Edit the Spark configuration as you see fit
spark-config-default:
- spark.driver.cores: "1"
- spark.hadoop.fs.s3a.aws.credentials.provider: "com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
- spark.kubernetes.allocation.batch.size: "50"
- spark.hadoop.fs.s3a.acl.default: "BucketOwnerFullControl"
- spark.hadoop.fs.s3n.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
- spark.hadoop.fs.AbstractFileSystem.s3n.impl: "org.apache.hadoop.fs.s3a.S3A"
- spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
- spark.hadoop.fs.AbstractFileSystem.s3.impl: "org.apache.hadoop.fs.s3a.S3A"
- spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
- spark.hadoop.fs.AbstractFileSystem.s3a.impl: "org.apache.hadoop.fs.s3a.S3A"
- spark.network.timeout: 600s
- spark.executorEnv.KUBERNETES_REQUEST_TIMEOUT: 100000
- spark.executor.heartbeatInterval: 60s
configmap:
enabled_plugins:
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- spark
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
spark: spark
Edit the relevant YAML file to specify the plugin.
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- dask
default-for-task-types:
- container: container
- container_array: k8s-array
- dask: dask
Create a file named values-override.yaml
and add the following config to it:
configmap:
enabled_plugins:
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- dask
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
dask: dask
Upgrade the deployment#
If you are installing Flyte via the Helm chart, run the following command:
Note
There is no need to run helm upgrade
for Spark.
helm upgrade <RELEASE_NAME> flyteorg/flyte-binary -n <YOUR_NAMESPACE> --values <YOUR_YAML_FILE>
Replace <RELEASE_NAME>
with the name of your release (e.g., flyte-backend
),
<YOUR_NAMESPACE>
with the name of your namespace (e.g., flyte
),
and <YOUR_YAML_FILE>
with the name of your YAML file.
helm upgrade <RELEASE_NAME> flyte/flyte-core -n <YOUR_NAMESPACE> --values values-override.yaml
Replace <RELEASE_NAME>
with the name of your release (e.g., flyte
)
and <YOUR_NAMESPACE>
with the name of your namespace (e.g., flyte
).
Wait for the upgrade to complete. You can check the status of the deployment pods by running the following command:
kubectl get pods -n --all-namespaces