Databricks#

Tags: Spark, Integration, DistributedComputing, Data, Advanced

Flyte can be seamlessly integrated with the Databricks service, enabling you to effortlessly submit Spark jobs to the Databricks platform.

Install the plugin#

The Databricks plugin comes bundled with the Spark plugin. To execute it locally, run the following command:

pip install flytekitplugins-spark

If you intend to run the plugin on the Flyte cluster, you must first set it up on the backend. Please refer to the Databricks plugin setup guide for detailed instructions.

Run the example on the Flyte cluster#

To run the provided example on the Flyte cluster, use the following command:

pyflyte run --remote \
  --image ghcr.io/flyteorg/flytecookbook:databricks_plugin-latest \
  https://raw.githubusercontent.com/flyteorg/flytesnacks/master/examples/databricks_plugin/databricks_plugin/databricks_job.py \
  my_databricks_job

Note

Using Spark on Databricks is incredibly simple and offers comprehensive versioning through a custom-built Spark container. This built container also facilitates the execution of standard Spark tasks.

To utilize Spark, the image should employ a base image provided by Databricks, and the workflow code must be copied to /databricks/driver.

FROM databricksruntime/standard:12.2-LTS
LABEL org.opencontainers.image.source=https://github.com/flyteorg/flytesnacks

ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /databricks/driver
ENV PATH="/databricks/python3/bin:$PATH"
USER 0

RUN sudo apt-get update && sudo apt-get install -y make build-essential libssl-dev git

# Install Python dependencies
COPY ./requirements.txt /databricks/driver/requirements.txt
RUN /databricks/python3/bin/pip install -r /databricks/driver/requirements.txt

WORKDIR /databricks/driver

# Copy the actual code
COPY . /databricks/driver/

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows and launch plans.
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag

Running Spark on Databricks