Understanding the State Transition in a Workflow

High Level Overview of How a Workflow Progresses to Success

Happy case for a workflow with one node and one task.

This state diagram illustrates an extremely high-level, simplistic view of the state transitions that a workflow with a single node and one task will go through as the observer observes success.

The following sections explain the various observable (and some hidden) states for workflow, node, and task state transitions.

Workflow States

The state diagram illustrates the various states through which a workflow transitions. This is the core finite state machine (FSM) of a workflow.

The state diagram above illustrates the various states through which a workflow transitions. This is the core finite state machine of a workflow.

A workflow always starts in the Ready state and ends either in Failed, Succeeded, or Aborted state. Any system error within a state causes a retry on that state. These retries are capped by system retries which will eventually lead to an Aborted state if the failure continues.

Note

System retry can be of two types:

  • Downstream System Retry: When a downstream system (or service) fails, or remote service is not contactable, the failure is retried against the number of retries set here. This does end-to-end system retry against the node whenever the task fails with a system error. This is useful when the downstream service throws a 500 error, abrupt network failure happens, etc.

  • Transient Failure Retry: This retry mechanism offers resiliency to transient failures, which are opaque to the user. It is tracked across the entire execution for the duration of the execution. It helps Flyte entities and the additional services connected to Flyte like S3 to continue operating despite a system failure. Indeed, all transient failures are handled gracefully by Flyte! Moreover, in case of a transient failure retry, Flyte does not necessarily retry the entire task. “Retrying an entire task” means that the entire pod associated with Flyte task is rerun with a clean slate; instead, it just retries the atomic operation. For example, it keeps trying to persist the state until it can, exhausts the max retries, and backs off. To set a transient failure retry:

Every transition between states is recorded in FlyteAdmin using workflowexecutionevent.

The phases in the above state diagram are captured in the admin database as specified here workflowexecution.phase and are sent as part of the Execution event.

The state machine specification for the illustration can be found here.

Node States

The state diagram illustrates the various states through which a node transitions. This is the core finite state machine for a node.

The state diagram above illustrates the various states through which a node transitions. This is the core finite state machine for a node. From the user’s point of view, a workflow simply consists of a sequence of tasks. But to Flyte, a workflow internally creates a meta entity called a node.

Once a Workflow enters a Running state, it triggers the phantom start node of the workflow. The start node is always the entry node of any workflow. The start node starts executing all its child-nodes using a modified Depth First Search algorithm recursively.

Nodes can be of different types as follows, but all the nodes traverse through the same transitions:

  1. Start Node - Only exists during the execution and is not modeled in the core spec

  2. Task Node

  3. Branch Node

  4. Workflow Node

  5. Dynamic Node - Just a task node that does not return output but constitutes a dynamic workflow. When the task runs, it stays in the RUNNING state. Once the task completes and Flyte starts executing the dynamic workflow, the overarching node that contains both the original task and the dynamic workflow enters DYNAMIC_RUNNING state.

  6. End Node - Only exists during the execution and is not modeled in the core spec

Every transition between states is recorded in FlyteAdmin using nodeexecutionevent.

Every NodeExecutionEvent can have any nodeexecution.phase.

Note

TODO: Add explanation for each phase.

The state machine specification for the illustration can be found here.

Task States

The state diagram illustrates the various states through which a task transitions. This is the core finite state machine for any task in Flyte.

The state diagram above illustrates the various states through which a task transitions. This is the core finite state machine for a task.

Every transition between states is recorded in FlyteAdmin using taskexecutionevent.

Every TaskExecutionEvent can have any taskexecution.phase.

Note

TODO: Add explanation for each phase.

The state machine specification for the illustration can be found here.