Configure Kubernetes Plugins#

Tags: Kubernetes, Integration, Spark, AWS, GCP, Advanced

This guide provides an overview of setting up the Kubernetes Operator backend plugin in your Flyte deployment.

Spin up a cluster#

Enable the PyTorch plugin on the demo cluster by adding the following block to ~/.flyte/sandbox/config.yaml:

tasks:
  task-plugins:
    default-for-task-types:
      container: container
      container_array: k8s-array
      sidecar: sidecar
      pytorch: pytorch
    enabled-plugins:
    - container
    - k8s-array
    - sidecar
    - pytorch

Start the demo cluster by running the following command:

flytectl demo start

Note

Add the Flyte chart repo to Helm if you’re installing via the Helm charts.

helm repo add flyteorg https://flyteorg.github.io/flyte

If you have installed Flyte using the flyte-sandbox Helm chart, please ensure:

  • You have the correct kubeconfig and have selected the correct Kubernetes context.

  • You have configured the correct flytectl settings in ~/.flyte/config.yaml.

    • You have the correct kubeconfig and have selected the correct Kubernetes context.

    • You have configured the correct flytectl settings in ~/.flyte/config.yaml.

    create the following four files and apply them using kubectl apply -f <filename>:

    1. serviceaccount.yaml

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: default
      namespace: "{{ namespace }}"
      annotations:
        eks.amazonaws.com/role-arn: "{{ defaultIamRole }}"
    
    1. spark_role.yaml

    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: spark-role
      namespace: "{{ namespace }}"
    rules:
      - apiGroups:
          - ""
        resources:
          - pods
          - services
          - configmaps
        verbs:
          - "*"
    
    1. spark_service_account.yaml

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: spark
      namespace: "{{ namespace }}"
      annotations:
        eks.amazonaws.com/role-arn: "{{ defaultIamRole }}"
    
    1. spark_role_binding.yaml

    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: spark-role-binding
      namespace: "{{ namespace }}"
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: spark-role
    subjects:
      - kind: ServiceAccount
        name: spark
        namespace: "{{ namespace }}"
    

Install the Kubernetes operator#

First, install kustomize.

Build and apply the training-operator.

export KUBECONFIG=$KUBECONFIG:~/.kube/config:~/.flyte/k3s/k3s.yaml
kustomize build "https://github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.5.0" | kubectl apply -f -

Optional: Using a gang scheduler

To address potential issues with worker pods of distributed training jobs being scheduled at different times due to resource constraints, you can opt for a gang scheduler. This ensures that all worker pods are scheduled simultaneously, reducing the likelihood of job failures caused by timeout errors.

To enable gang scheduling for the Kubeflow training-operator, you can install the Kubernetes scheduler plugins or the Apache YuniKorn scheduler.

  1. Install the scheduler plugin or Apache YuniKorn as a second scheduler.

  2. Configure the Kubeflow training-operator to use the new scheduler:

    Create a manifest called kustomization.yaml with the following content:

    apiVersion: kustomize.config.k8s.io/v1beta1
    kind: Kustomization
    
    resources:
    - github.com/kubeflow/training-operator/manifests/overlays/standalone
    
    patchesStrategicMerge:
    - patch.yaml
    

    Create a patch file called patch.yaml with the following content:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: training-operator
    spec:
      template:
        spec:
          containers:
          - name: training-operator
            command:
            - /manager
            - --gang-scheduler-name=<scheduler-plugins/yunikorn>
    

    Install the patched kustomization with the following command:

    kustomize build path/to/overlay/directory | kubectl apply -f -
    

    (Only for Apache YuniKorn) To configure gang scheduling with Apache YuniKorn, make sure to set the following annotations in Flyte pod templates:

    • template.metadata.annotations.yunikorn.apache.org/task-group-name

    • template.metadata.annotations.yunikorn.apache.org/task-groups

    • template.metadata.annotations.yunikorn.apache.org/schedulingPolicyParameters

    For more configuration details, refer to the Apache YuniKorn Gang-Scheduling documentation.

  3. Use a Flyte pod template with template.spec.schedulerName: scheduler-plugins-scheduler to use the new gang scheduler for your tasks.

    See Configuring task pods with K8s PodTemplates for more information on pod templates in Flyte. You can set the scheduler name in the pod template passed to the @task decorator. However, to prevent the two different schedulers from competing for resources, it is recommended to set the scheduler name in the pod template in the flyte namespace which is applied to all tasks. Non distributed training tasks can be scheduled by the gang scheduler as well.

    For more information on pod templates in Flyte, see Configuring task pods with K8s PodTemplates. You can set the scheduler name in the pod template passed to the @task decorator. However, to avoid resource competition between the two different schedulers, it is recommended to set the scheduler name in the pod template in the flyte namespace, which is applied to all tasks. This allows non-distributed training tasks to be scheduled by the gang scheduler as well.

Specify plugin configuration#

To specify the plugin when using the Helm chart, edit the relevant YAML file.

tasks:
  task-plugins:
    enabled-plugins:
      - container
      - sidecar
      - k8s-array
      - pytorch
    default-for-task-types:
      - container: container
      - container_array: k8s-array
      - pytorch: pytorch

Upgrade the deployment#

If you are installing Flyte via the Helm chart, run the following command:

Note

There is no need to run helm upgrade for Spark.

helm upgrade <RELEASE_NAME> flyteorg/flyte-binary -n <YOUR_NAMESPACE> --values <YOUR_YAML_FILE>

Replace <RELEASE_NAME> with the name of your release (e.g., flyte-backend), <YOUR_NAMESPACE> with the name of your namespace (e.g., flyte), and <YOUR_YAML_FILE> with the name of your YAML file.

Wait for the upgrade to complete. You can check the status of the deployment pods by running the following command:

kubectl get pods -n flyte