Extend Data Persistence layer#

Flytekit provides a data persistence layer, which is used for recording metadata that is shared with the Flyte backend. This persistence layer is available for various types to store raw user data and is designed to be cross-cloud compatible. Moreover, it is designed to be extensible and users can bring their own data persistence plugins by following the persistence interface.

Note

This will become extensive for a variety of use-cases, but the core set of APIs have been battle tested.

flytekit.core.data_persistence#

The Data persistence module is used by core flytekit and most of the core TypeTransformers to manage data fetch & store, between the durable backend store and the runtime environment. This is designed to be a pluggable system, with a default simple implementation that ships with the core.

DataPersistence

Base abstract type for all DataPersistence operations.

DataPersistencePlugins

DataPersistencePlugins is the core plugin registry that stores all DataPersistence plugins.

DiskPersistence

The simplest form of persistence that is available with default flytekit - Disk-based persistence.

FileAccessProvider

This is the class that is available through the FlyteContext and can be used for persisting data to the remote durable store.

UnsupportedPersistenceOp

This exception is raised for all methods when a method is not supported by the data persistence layer

DataPersistence Extras#

This module provides some default implementations of flytekit.DataPersistence. These implementations use command-line clients to download and upload data. The actual binaries need to be installed for these extras to work. The binaries are not bundled with flytekit to keep it lightweight.

Persistence Extras#

GCSPersistence([default_prefix, data_config])

This DataPersistence plugin uses a preinstalled GSUtil binary in the container to download and upload data.

HttpPersistence(*args, **kwargs)

DataPersistence implementation for the HTTP protocol.

S3Persistence([default_prefix, data_config])

DataPersistence plugin for AWS S3 (and Minio).

The fsspec Data Plugin#

Flytekit ships with a default storage driver that uses aws-cli on AWS and gsutil on GCP. By default, Flyte uploads the task outputs to S3 or GCS using these storage drivers.

Why fsspec?#

You can use the fsspec plugin implementation to utilize all its available plugins with flytekit. The fsspec plugin provides an implementation of the data persistence layer in Flytekit. For example: HDFS, FTP are supported in fsspec, so you can use them with flytekit too. The data persistence layer helps store logs of metadata and raw user data. As a consequence of the implementation, an S3 driver can be installed using pip install s3fs.

Here is a code snippet that shows protocols mapped to the class it implements.

Once you install the plugin, it overrides all default implementations of the DataPersistencePlugins and provides the ones supported by fsspec.

Note

This plugin installs fsspec core only. To install all the fsspec plugins, see here.