Databricks plugin

Tags: Spark, Integration, DistributedComputing, Data, Advanced

Note

This is a legacy implementation of the Databricks integration. We recommend using the Databricks agent instead.

Flyte can be integrated with the Databricks service, enabling you to submit Spark jobs to the Databricks platform.

Installation

The Databricks plugin comes bundled with the Spark plugin. To install the Spark plugin, run the following command:

pip install flytekitplugins-spark

Flyte deployment configuration

To run the Databricks plugin on a Flyte cluster, you must configure it in your Flyte deployment. For more information, see the Databricks plugin setup guide.

Example usage

For a usage example, see the Databricks plugin example page.

Run the example on the Flyte cluster

To run the provided example on a Flyte cluster, use the following command:

pyflyte run --remote \
  --image ghcr.io/flyteorg/flytecookbook:databricks_plugin-latest \
  https://raw.githubusercontent.com/flyteorg/flytesnacks/master/examples/databricks_plugin/databricks_plugin/databricks_job.py \
  my_databricks_job

Using Spark on Databricks allows comprehensive versioning through a custom-built Spark container. This container also facilitates the execution of standard Spark tasks.

To use Spark, the image should employ a base image provided by Databricks, and the workflow code must be copied to /databricks/driver.

FROM databricksruntime/standard:12.2-LTS
LABEL org.opencontainers.image.source=https://github.com/flyteorg/flytesnacks

ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /databricks/driver
ENV PATH="/databricks/python3/bin:$PATH"
USER 0

RUN sudo apt-get update && sudo apt-get install -y make build-essential libssl-dev git

# Install Python dependencies
COPY ./requirements.in /databricks/driver/requirements.in
RUN /databricks/python3/bin/pip install -r /databricks/driver/requirements.in

WORKDIR /databricks/driver

# Copy the actual code
COPY . /databricks/driver/

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows and launch plans.
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag