Databricks#

Tags: Spark, Integration, DistributedComputing, Data, Advanced

Flyte backend can be connected with Databricks service. Once enabled it can allow you to submit a spark job to Databricks platform. This section will provide how to use the Databricks Plugin using flytekit python.

Installation#

The flytekit Databricks plugin is bundled into its Spark plugin, so to use, simply run the following:

pip install flytekitplugins-spark

How to Build Your Dockerfile for Spark on Databricks#

Using Spark on Databricks is extremely easy and provides full versioning using the custom-built Spark container. The built container can also execute regular Spark tasks. For Spark, the image must use a base image built by Databricks and the workflow code must copy to /databricks/driver

 1FROM databricksruntime/standard:11.3-LTS
 2LABEL org.opencontainers.image.source=https://github.com/flyteorg/flytesnacks
 3# To build this dockerfile, run "make docker_build".
 4
 5ENV VENV /opt/venv
 6ENV LANG C.UTF-8
 7ENV LC_ALL C.UTF-8
 8ENV PYTHONPATH /databricks/driver
 9ENV PATH="/databricks/python3/bin:$PATH"
10USER 0
11
12RUN sudo apt-get update && sudo apt-get install -y make build-essential libssl-dev git
13
14# Install custom package
15RUN /databricks/python3/bin/pip install awscli
16WORKDIR /opt
17RUN curl https://sdk.cloud.google.com > install.sh
18RUN bash /opt/install.sh --install-dir=/opt
19
20# Install Python dependencies
21COPY databricks/requirements.txt /databricks/driver/requirements.txt
22RUN /databricks/python3/bin/pip install -r /databricks/driver/requirements.txt
23
24WORKDIR /databricks/driver
25# Copy the makefile targets to expose on the container. This makes it easier to register.
26# Delete this after we update CI
27COPY databricks/in_container.mk /databricks/driver/Makefile
28
29# Delete this after we update CI to not serialize inside the container
30COPY databricks/sandbox.config /databricks/driver
31
32# Copy the actual code
33COPY databricks/ /databricks/driver/databricks/
34
35# This tag is supplied by the build script and will be used to determine the version
36# when registering tasks, workflows, and launch plans
37ARG tag
38ENV FLYTE_INTERNAL_IMAGE $tag

Configuring the backend to get Databricks plugin working#

  1. Make sure to add “databricks” in tasks.task-plugins.enabled-plugin in enabled_plugins.yaml

  2. Add Databricks access token to Flytepropeller. here to see more detail to create Databricks access token.

kubectl edit secret -n flyte flyte-propeller-auth

Configuration will be like below

apiVersion: v1
data:
  FLYTE_DATABRICKS_API_TOKEN: <ACCESS_TOKEN>
kind: Secret
metadata:
  annotations:
    meta.helm.sh/release-name: flyte
    meta.helm.sh/release-namespace: flyte
...

Writing a PySpark Task

Writing a PySpark Task

Gallery generated by Sphinx-Gallery