Executing Spark Jobs natively on K8s Cluster

Flyte can execute spark jobs natively on a Kubernetes Cluster. Flyte will manage the lifecycle, spin-up and tear down of a virtual cluster. This leverages open source Spark On K8s Operator and can be enabled without signing up for any service. If you want to enable Spark for your Flyte Cluster refer to std:ref:`plugins-spark-k8s`_. Flytekit makes it possible to write pyspark code natively as a task and the Spark cluster will be automatically configured using the decorated SparkConf. The examples in this section provide a hands on tutorial of writing pyspark tasks.

Pre-Requisites / Setup

  1. Install flytekitplugins-spark using pip in your environment that contains flytekit >= 0.16.0

    pip install flytekitplugins-spark
    
  2. Build Spark Image correctly as explained below.

#. Enable Spark Plugin for Flyte following std:ref:`plugins-spark-k8s`_. In-addition, Flyte uses the SparkOperator to run Spark Jobs as well as separate K8s Service Account/Role per namespace. All of these are created as part of the standard Flyte deploy.

  1. Ensure you have enough resources on your K8s cluster. Based on the resources required for your spark job (across driver/executors), you might have to tweak resource-quotas for the namespace.

How to build your Dockerfile for Spark on K8s

Using Spark on K8s is extremely easy and provides full versioning using the custom built Spark container. The built container can also execute regular Spark tasks. For Spark, the image must contain spark dependencies as well as the correct entrypoint for the Spark driver/executors. This can be achieved by using the flytekit_install_spark.sh script provided as referenced in the Dockerfile included here.

Gallery generated by Sphinx-Gallery