Run Spark on your Kubernetes Cluster¶
Tip
If you just looking for examples of spark on flyte - refer to Cookbook Spark Plugin
Flyte has an optional plugin that makes it possible to run Apache Spark jobs native on your kubernetes cluster. This plugin has been used extensively at Lyft and is battle tested. This makes it extremely easy to run your pyspark (coming soon scala/java) code as a task. The plugin creates a new virtual cluster for the spark execution dynamically and Flyte will manage the execution, auto-scaling for the spark job.
Caution
This has been tested at scale and more than 100k Spark Jobs run through Flyte at Lyft. This still needs a large capacity on Kubernetes and careful configuration. We recommend using Multi-cluster mode - How do I use multiple Kubernetes clusters?, and enabling How do I limit resources per project/domain? for large and extremely frequent Spark Jobs. For extremely short running jobs, this is still not a recommended approach, and it might be better to use a pre-spawned cluster.
Why use K8s Spark?¶
Managing Python dependencies is hard. Flyte makes it easy to version and manage dependencies using Containers. K8s Spark plugin brings all the benefits of containerization to spark and without needing to manage special spark clusters.
Pros:¶
Extremely easy to get started and get complete isolation between workloads
Every job runs in isolation and has its own virtual cluster - no more nightmarish dependency management
Flyte manages everything for you!
Cons:¶
Short running, bursty jobs are not a great fit - because of the container overhead
No interactive spark capabilities available with Flyte K8s spark which is more suited for running, adhoc and/or scheduled jobs
How to enable Spark in flyte backend?¶
Flyte Spark uses the Spark On K8s Operator and a custom built Flyte Spark Plugin. The plugin is a backend plugin and you have to enable it in your deployment. To enable a plugin follow the steps in How do I enable backend plugins?.
You can optionally configure the Plugin as per the - backend Config Structure and an example Config is defined here, which looks like,
Spark in Flytekit¶
For a more complete example refer to Cookbook Spark Plugin
Ensure you have
flytekit>=0.16.0
Enable Spark in backend, following the previous section.
Install the flytekit spark plugin
pip install flytekitplugins-spark
Write regular pyspark code - with one change in
@task
decorator. Refer to the example@task( task_config=Spark( # this configuration is applied to the spark cluster spark_conf={ "spark.driver.memory": "1000M", "spark.executor.instances": "2", "spark.driver.cores": "1", } ), cache_version="1", cache=True, ) def hello_spark(partitions: int) -> float: ... sess = flytekit.current_context().spark_session # Regular Pypsark code ...
Run it locally
hello_spark(partitions=10)
Use it in a workflow (check cookbook)
Run it on a remote cluster - To do this, you have to build the correct dockerfile, as explained here How to build your Dockerfile for Spark on K8s. You can also you the Standard Dockerfile recommended by Spark.