Databricks Plugin#

This guide provides an overview of how to set up Databricks in your Flyte deployment.

Spin up a cluster#

You can spin up a demo cluster using the following command:

flytectl demo start

Or install Flyte using the flyte-binary helm chart.


Add the Flyte chart repo to Helm if you’re installing via the Helm charts.

helm repo add flyteorg

Databricks workspace#

To set up your Databricks account, follow these steps:

  1. Create a Databricks account.

A screenshot of Databricks workspace creation.
  1. Ensure that you have a Databricks workspace up and running.

A screenshot of Databricks workspace.
  1. Generate a personal access token to be used in the Flyte configuration. You can find the personal access token in the user settings within the workspace. User settings -> Developer -> Access tokens

A screenshot of access token.
  1. Enable custom containers on your Databricks cluster before you trigger the workflow.

curl -X PATCH -n -H "Authorization: Bearer <your-personal-access-token>" \
https://<databricks-instance>/api/2.0/workspace-conf \
-d '{"enableDcs": "true"}'

For more detail, check custom containers.

5. Create an instance profile for the Spark cluster. This profile enables the Spark job to access your data in the S3 bucket.

Create an instance profile using the AWS console (For AWS Users)#

  1. In the AWS console, go to the IAM service.

  2. Click the Roles tab in the sidebar.

  3. Click Create role.

    1. Under Trusted entity type, select AWS service.

    2. Under Use case, select EC2.

    3. Click Next.

    4. At the bottom of the page, click Next.

    5. In the Role name field, type a role name.

    6. Click Create role.

  4. In the role list, click the AmazonS3FullAccess role.

  5. Click Create role button.

In the role summary, copy the Role ARN.

A screenshot of s3 arn.

Locate the IAM role that created the Databricks deployment#

If you don’t know which IAM role created the Databricks deployment, do the following:

  1. As an account admin, log in to the account console.

  2. Go to Workspaces and click your workspace name.

  3. In the Credentials box, note the role name at the end of the Role ARN

For example, in the Role ARN arn:aws:iam::123456789123:role/finance-prod, the role name is finance-prod

Edit the IAM role that created the Databricks deployment#

  1. In the AWS console, go to the IAM service.

  2. Click the Roles tab in the sidebar.

  3. Click the role that created the Databricks deployment.

  4. On the Permissions tab, click the policy.

  5. Click Edit Policy.

  6. Append the following block to the end of the Statement array. Ensure that you don’t overwrite any of the existing policy. Replace <iam-role-for-s3-access> with the role you created in Configure S3 access with instance profiles.

  "Effect": "Allow",
  "Action": "iam:PassRole",
  "Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"

6. Upload the following file to either DBFS (the final path will be dbfs:///FileStore/tables/ or S3. This file will be executed by the Spark driver node, overriding the default command of the Databricks job. This entrypoint file will

  1. Download the inputs from S3 to the local filesystem.

  2. Execute the spark task.

  3. Upload the outputs from the local filesystem to S3 for the downstream tasks to consume.

A screenshot of dbfs.
import os
import sys
from typing import List

import click
import pandas
from flytekit.bin.entrypoint import fast_execute_task_cmd as _fast_execute_task_cmd
from flytekit.bin.entrypoint import execute_task_cmd as _execute_task_cmd
from flytekit.exceptions.user import FlyteUserException
from import download_distribution

def fast_execute_task_cmd(additional_distribution: str, dest_dir: str, task_execute_cmd: List[str]):
    if additional_distribution is not None:
        if not dest_dir:
            dest_dir = os.getcwd()
        download_distribution(additional_distribution, dest_dir)

    # Insert the call to fast before the unbounded resolver args
    cmd = []
    for arg in task_execute_cmd:
        if arg == "--resolver":
            cmd.extend(["--dynamic-addl-distro", additional_distribution, "--dynamic-dest-dir", dest_dir])

    click_ctx = click.Context(click.Command("dummy"))
    parser = _execute_task_cmd.make_parser(click_ctx)
    args, _, _ = parser.parse_args(cmd[1:])
    _execute_task_cmd.callback(test=False, **args)

def main():
    args = sys.argv
    click_ctx = click.Context(click.Command("dummy"))
    if args[1] == "pyflyte-fast-execute":
        parser = _fast_execute_task_cmd.make_parser(click_ctx)
        args, _, _ = parser.parse_args(args[2:])
    elif args[1] == "pyflyte-execute":
        parser = _execute_task_cmd.make_parser(click_ctx)
        args, _, _ = parser.parse_args(args[2:])
        _execute_task_cmd.callback(test=False, dynamic_addl_distro=None, dynamic_dest_dir=None, **args)
        raise FlyteUserException(f"Unrecognized command: {args[1:]}")

if __name__ == '__main__':

Specify plugin configuration#


Demo cluster saves the data to minio, but Databricks job saves the data to S3. Therefore, you need to update the AWS credentials for the single binary deployment, so the pod can access the S3 bucket that DataBricks job writes to.

Enable the Databricks plugin on the demo cluster by adding the following config to ~/.flyte/sandbox/config.yaml:

      container: container
      container_array: k8s-array
      sidecar: sidecar
      spark: databricks
      - container
      - sidecar
      - k8s-array
      - databricks
    entrypointFile: dbfs:///FileStore/tables/
    databricksInstance: <DATABRICKS_ACCOUNT>
  region: <AWS_REGION>
  scheme: aws
    durationMinutes: 3
  rawoutput-prefix: s3://<S3_BUCKET_NAME>/
  container: "<S3_BUCKET_NAME>"
  type: s3
    kind: s3
      region: <AWS_REGION>
      disable_ssl: true
      v2_signing: false
      auth_type: accesskey
      access_key_id: <AWS_ACCESS_KEY_ID>
      secret_key: <AWS_SECRET_ACCESS_KEY>
      endpoint: ""

Substitute <DATABRICKS_ACCOUNT> with the name of your Databricks account, <AWS_REGION> with the region where you created your AWS bucket, <AWS_ACCESS_KEY_ID> with your AWS access key ID, <AWS_SECRET_ACCESS_KEY> with your AWS secret access key, and <S3_BUCKET_NAME> with the name of your S3 bucket.

Add the Databricks access token#

Add the Databricks access token to FlytePropeller:

Add the access token as an environment variable to the flyte-sandbox deployment.

kubectl edit deploy flyte-sandbox -n flyte

Update the env configuration:

- name: POD_NAME
    apiVersion: v1
    apiVersion: v1
    fieldPath: metadata.namespace
  value: <ACCESS_TOKEN>
image: flyte-binary:sandbox

Replace <ACCESS_TOKEN> with your access token.

Upgrade the deployment#

kubectl rollout restart deployment flyte-sandbox -n flyte

Wait for the upgrade to complete. You can check the status of the deployment pods by running the following command:

kubectl get pods -n flyte

For databricks plugin on the Flyte cluster, please refer to Databricks Plugin Example