Understanding the State Transition in a Workflow#

Tags: Basic, Design

High Level Overview of How a Workflow Progresses to Success#

flowchart TD id1(( )) id1 --> Ready Ready --> Running subgraph Running id2(( )) id2 --> NodeQueued NodeQueued --> NodeRunning subgraph NodeRunning id3(( )) id3 --> TaskQueued TaskQueued --> TaskRunning TaskRunning --> TaskSuccess end TaskSuccess --> NodeSuccess end NodeSuccess --> Success

This state diagram illustrates a high-level, simplistic view of the state transitions that a workflow with a single task and node would go through as the user observes success.

The following sections explain the various observable (and some hidden) states for workflow, node, and task state transitions.

Workflow States#

flowchart TD Queued -->|On system errors more than threshold| Aborted Queued --> Ready Ready--> |Write inputs to workflow| Running Running--> |On system error| Running Running--> |On all Nodes Success| Succeeding Succeeding--> |On successful event send to Admin| Succeeded Succeeding--> |On system error| Succeeding Ready--> |On precondition failure| Failing Running--> |On any Node Failure| Failing Ready--> |On user initiated abort| Aborting Running--> |On user initiated abort| Aborting Succeeding--> |On user initiated abort| Aborting Failing--> |If Failure node exists| HandleFailureNode Failing--> |On user initiated abort| Aborting HandleFailureNode--> |On completing failure node| Failed HandleFailureNode--> |On user initiated abort| Aborting Failing--> |On successful send of Failure node| Failed Aborting--> |On successful event send to Admin| Aborted

A workflow always starts in the Ready state and ends either in Failed, Succeeded, or Aborted state. Any system error within a state causes a retry on that state. These retries are capped by system retries which eventually lead to an Aborted state if the failure persists.

Every transition between states is recorded in FlyteAdmin using workflowexecutionevent.

The phases in the above state diagram are captured in the admin database as specified here workflowexecution.phase and are sent as a part of the Execution event.

The state machine specification for the illustration can be found here.

Node States#

flowchart TD id1(( )) id1-->NotYetStarted id1-->|Will stop the node execution |Aborted NotYetStarted-->|If all upstream nodes are ready, i.e, inputs are ready | Queued NotYetStarted--> |If the branch was not taken |Skipped Queued-->|Start task execution- attempt 0 | Running Running-->|If task timeout has elapsed and retry_attempts >= max_retries|TimingOut Running-->|Internal state|Succeeding Running-->|For dynamic nodes generating workflows| DynamicRunning DynamicRunning-->TimingOut DynamicRunning-->RetryableFailure TimingOut-->|If total node timeout has elapsed|TimedOut DynamicRunning-->Succeeding Succeeding-->|User observes the task as succeeded| Succeeded Running-->|on retryable failure| RetryableFailure RetryableFailure-->|if retry_attempts < max_retries|Running RetryableFailure-->|retry_attempts >= max_retries|Failing Failing-->Failed Succeeded-->id2(( )) Failed-->id2(( ))

This state diagram illustrates the node transition through various states. This is the core finite state machine for a node. From the user’s perspective, a workflow simply consists of a sequence of tasks. But to Flyte, a workflow internally creates a meta entity known as node.

Once a Workflow enters the Running state, it triggers the phantom start node of the workflow. The start node is considered to be the entry node of any workflow. The start node begins by executing all its child-nodes using a modified Depth First Search algorithm recursively.

Nodes can be of different types as listed below, but all the nodes traverse through the same transitions:

  1. Start Node - Only exists during the execution and is not modeled in the core spec.

  2. Task Node

  3. Branch Node

  4. Workflow Node

  5. Dynamic Node - Just a task node that does not return output but constitutes a dynamic workflow. When the task runs, it remains in the RUNNING state. Once the task completes and Flyte starts executing the dynamic workflow, the overarching node that contains both the original task and the dynamic workflow enters DYNAMIC_RUNNING state.

  6. End Node - Only exists during the execution and is not modeled in the core spec

Every transition between states is recorded in FlyteAdmin using nodeexecutionevent.

Every NodeExecutionEvent can have any nodeexecution.phase.

Note

TODO: Add explanation for each phase.

The state machine specification for the illustration can be found here.

Task States#

flowchart TD id1(( )) id1-->|Aborted by NodeHandler- timeouts, external abort, etc,.| NotReady id1-->Aborted NotReady-->|Optional-Blocked on resource quota or resource pool | WaitingForResources WaitingForResources--> |Optional- Has been submitted, but hasn't started |Queued Queued-->|Optional- Prestart initialization | Initializing Initializing-->|Actual execution of user code has started|Running Running-->|Successful execution|Success Running-->|Failed with a retryable error|RetryableFailure Running-->|Unrecoverable failure, will stop all execution|PermanentFailure Success-->id2(( )) RetryableFailure-->id2(( )) PermanentFailure-->id2(( ))

The state diagram above illustrates the various states through which a task transitions. This is the core finite state machine for a task.

Every transition between states is recorded in FlyteAdmin using taskexecutionevent.

Every TaskExecutionEvent can have any taskexecution.phase.

Note

TODO: Add explanation for each phase.

The state machine specification for the illustration can be found here.