Flyte is designed to be highly extensible and can be customized in multiple ways.
Flytekit plugins are simple plugins that can be implemented purely in python, unit tested locally and allow extending Flytekit functionality. These plugins can be anything and for comparison can be thought of like Airflow Operators.
Execute SQL queries as tasks.
Validate data with
Execute Jupyter Notebooks with
Validate pandas dataframes with
Scale pandas workflows with
Version your SQL database with
Run and test your
dbt pipelines in Flyte.
whylogs: the open standard for data logging.
mlflow: the open standard for model tracking.
Convert ML models to ONNX models seamlessly.
Run analytical queries using DuckDB.
Using flytekit plugins
Data is automatically marshalled and unmarshalled in and out of the plugin. Users should mostly implement the
PythonTask API defined in Flytekit.
Flytekit Plugins are lazily loaded and can be released independently like libraries. We follow a convention to name the
flytekitplugins-*, where * indicates the package to be integrated into Flytekit. For example
flytekitplugins-papermill enables users to author Flytekit tasks using Papermill.
You can find the plugins maintained by the core Flyte team here.
Native Backend Plugins#
Native Backend Plugins are the plugins that can be executed without any external service dependencies because the compute is orchestrated by Flyte itself, within its provisioned Kubernetes clusters.
Execute K8s pods for arbitrary workloads.
Run Dask jobs on a K8s Cluster.
Run Spark jobs on a K8s Cluster.
Run distributed PyTorch training jobs using
Run distributed TensorFlow training jobs using
Run distributed deep learning training jobs using Horovod and MPI.
Run Ray jobs on a K8s Cluster.
External Service Backend Plugins#
As the term suggests, external service backend plugins relies on external services like AWS Sagemaker, Hive or Snowflake for handling the workload defined in the Flyte task that use the respective plugin.
Train models with built-in or define your own custom algorithms.
Train Pytorch models using Sagemaker, with support for distributed training.
Execute queries using AWS Athena
Running tasks and workflows on AWS batch service
Run Hive jobs in your workflows.
Run Snowflake jobs in your workflows.
Run Databricks jobs in your workflows.
Run BigQuery jobs in your workflows.
Enabling Backend Plugins
To enable a backend plugin you have to add the
ID of the plugin to the enabled plugins list. The
enabled-plugins is available under the
tasks > task-plugins section of FlytePropeller’s configuration.
The plugin configuration structure is defined here. An example of the config follows,
tasks: task-plugins: enabled-plugins: - container - sidecar - k8s-array default-for-task-types: container: container sidecar: sidecar container_array: k8s-array
Finding the ``ID`` of the Backend Plugin
This is a little tricky since you have to look at the source code of the plugin to figure out the
ID. In the case of Spark, for example, the value of
ID is used here, defined as spark.
Enabling a Specific Backend Plugin in Your Own Kustomize Generator Flyte uses Kustomize to generate the the deployment configuration which can be leveraged to kustomize your own deployment.
Custom Container Tasks#
Because Flyte uses executable docker containers as the smallest unit of compute, you can write custom tasks with the
flytekit.ContainerTask via the flytekit SDK.
Execute arbitrary containers: You can write c++ code, bash scripts and any containerized program.
SDKs for Writing Tasks and Workflows#
The community would love to help you with your own ideas of building a new SDK. Currently the available SDKs are:
The Python SDK for Flyte.
The Java/Scala SDK for Flyte.
Flyte Airflow Provider#
The Flyte Airflow Provider is helpful to call Flyte tasks/workflows from within Airflow.