flytekitplugins.spark.ParquetToSparkDecodingHandler#

class flytekitplugins.spark.ParquetToSparkDecodingHandler[source]#

Extend this abstract class, implement the decode function, and register your concrete class with the StructuredDatasetTransformerEngine class in order for the core flytekit type engine to handle dataframe libraries. This is the decoder interface, meaning it is used when there is a Flyte Literal value, and we have to get a Python value out of it. For the other way, see the StructuredDatasetEncoder

Parameters
  • python_type – The dataframe class in question that you want to register this decoder with

  • protocol – A prefix representing the storage driver (e.g. ‘s3, ‘gs’, ‘bq’, etc.). You can use either “s3” or “s3://”. They are the same since the “://” will just be stripped by the constructor. If None, this decoder will be registered with all protocols that flytekit’s data persistence layer is capable of handling.

  • supported_format – Arbitrary string representing the format. If not supplied then an empty string will be used. An empty string implies that the decoder works with any format. If the format being asked for does not exist, the transformer enginer will look for the “” decoder instead and write a warning.

Methods

decode(ctx, flyte_value, current_task_metadata)[source]#

This is code that will be called by the dataset transformer engine to ultimately translate from a Flyte Literal value into a Python instance.

Parameters
  • ctx (flytekit.core.context_manager.FlyteContext) – A FlyteContext, useful in accessing the filesystem and other attributes

  • flyte_value (flytekit.models.literals.StructuredDataset) – This will be a Flyte IDL StructuredDataset Literal - do not confuse this with the StructuredDataset class defined also in this module.

  • current_task_metadata (flytekit.models.literals.StructuredDatasetMetadata) – Metadata object containing the type (and columns if any) for the currently executing task. This type may have more or less information than the type information bundled inside the incoming flyte_value.

Returns

This function can either return an instance of the dataframe that this decoder handles, or an iterator of those dataframes.

Return type

pyspark.sql.dataframe.DataFrame

Attributes

protocol
python_type
supported_format