Run Spark on your Kubernetes Cluster

Tip

If you just looking for examples of spark on flyte - refer to Cookbook Spark Plugin

Flyte has an optional plugin that makes it possible to run Apache Spark jobs native on your kubernetes cluster. This plugin has been used extensively at Lyft and is battle tested. This makes it extremely easy to run your pyspark (coming soon scala/java) code as a task. The plugin creates a new virtual cluster for the spark execution dynamically and Flyte will manage the execution, auto-scaling for the spark job.

Caution

This has been tested at scale and more than 100k Spark Jobs run through Flyte at Lyft. This still needs a large capacity on Kubernetes and careful configuration. We recommend using Multi-cluster mode - How do I use multiple Kubernetes clusters?, and enabling How do I limit resources per project/domain? for large and extremely frequent Spark Jobs. For extremely short running jobs, this is still not a recommended approach, and it might be better to use a pre-spawned cluster.

Why use K8s Spark?

Managing Python dependencies is hard. Flyte makes it easy to version and manage dependencies using Containers. K8s Spark plugin brings all the benefits of containerization to spark and without needing to manage special spark clusters.

Pros:

  1. Extremely easy to get started and get complete isolation between workloads

  2. Every job runs in isolation and has its own virtual cluster - no more nightmarish dependency management

  3. Flyte manages everything for you!

Cons:

  1. Short running, bursty jobs are not a great fit - because of the container overhead

  2. No interactive spark capabilities available with Flyte K8s spark which is more suited for running, adhoc and/or scheduled jobs

How to enable Spark in flyte backend?

Flyte Spark uses the Spark On K8s Operator and a custom built Flyte Spark Plugin. The plugin is a backend plugin and you have to enable it in your deployment. To enable a plugin follow the steps in How do I enable backend plugins?.

You can optionally configure the Plugin as per the - backend Config Structure and an example Config is defined here, which looks like,

Spark in Flytekit

For a more complete example refer to Cookbook Spark Plugin

  1. Ensure you have flytekit>=0.16.0

  2. Enable Spark in backend, following the previous section.

  3. Install the flytekit spark plugin

    pip install flytekitplugins-spark
    
  4. Write regular pyspark code - with one change in @task decorator. Refer to the example

    @task(
        task_config=Spark(
            # this configuration is applied to the spark cluster
            spark_conf={
                "spark.driver.memory": "1000M",
                "spark.executor.instances": "2",
                "spark.driver.cores": "1",
            }
        ),
        cache_version="1",
        cache=True,
    )
    def hello_spark(partitions: int) -> float:
        ...
        sess = flytekit.current_context().spark_session
        # Regular Pypsark code
        ...
    
  5. Run it locally

    hello_spark(partitions=10)
    
  6. Use it in a workflow (check cookbook)

  7. Run it on a remote cluster - To do this, you have to build the correct dockerfile, as explained here How to build your Dockerfile for Spark on K8s. You can also you the Standard Dockerfile recommended by Spark.