What is Data Catalog?#

Tags: Advanced, Design

DataCatalog is a service to index parameterized, strongly-typed data artifacts across revisions. It allows clients to query artifacts based on meta information and tags.

How Flyte Memoizes Task Executions on Data Catalog#

Flyte memoizes task executions by creating artifacts in DataCatalog and associating meta information regarding the execution with the artifact. Let’s walk through what happens when a task execution is cached on DataCatalog.

Every task instance is represented as a DataSet:

Dataset {
   project: Flyte project the task was registered in
   domain: Flyte domain for the task execution
   name: flyte_task-<taskName>
   version: <cache_version>-<hash(input params)>-<hash(output params)>
}

Every task execution is represented as an Artifact in the Dataset above:

Artifact {
   id: uuid
   Metadata: [executionName, executionVersion]
   ArtifactData: [List of ArtifactData]
}


ArtifactData {
   Name: <output-name>
   value: <offloaded storage location of the literal>
}

To retrieve the Artifact, tag the Artifact with a hash of the input values for the memoized task execution:

ArtifactTag {
   Name: flyte_cached-<unique hash of the input values>
}

When caching an execution, FlytePropeller will:

  1. Create a dataset for the task.

  2. Create an artifact that represents the execution, along with the artifact data that represents the execution output.

  3. Tag the artifact with a unique hash of the input values.

To ensure that the task execution is memoized, Flyte Propeller will:

  1. Compute the tag by computing the hash of the input.

  2. Check if a tagged artifact exists with that hash.

    • If it exists, we have a cache hit and the Propeller can skip the task execution.

    • If an artifact is not associated with the tag, Propeller needs to run the task.