Understanding the State Transition in a Workflow#

Tags: Basic, Design

High Level Overview of How a Workflow Progresses to Success#

        flowchart TD
  id1(( ))
  id1 --> Ready
  Ready --> Running
  subgraph Running
  id2(( ))
  id2 --> NodeQueued
  NodeQueued --> NodeRunning
  subgraph NodeRunning
  id3(( ))
  id3 --> TaskQueued
  TaskQueued --> TaskRunning
  TaskRunning --> TaskSuccess
  end
  TaskSuccess --> NodeSuccess
  end
  NodeSuccess --> Success
    

This state diagram illustrates a high-level, simplistic view of the state transitions that a workflow with a single task and node would go through as the user observes success.

The following sections explain the various observable (and some hidden) states for workflow, node, and task state transitions.

Workflow States#

        flowchart TD
  Queued -->|On system errors more than threshold| Aborted
  Queued --> Ready
  Ready--> |Write inputs to workflow| Running
  Running--> |On system error| Running
  Running--> |On all Nodes Success| Succeeding
  Succeeding--> |On successful event send to Admin| Succeeded
  Succeeding--> |On system error| Succeeding
  Ready--> |On precondition failure| Failing
  Running--> |On any Node Failure| Failing
  Ready--> |On user initiated abort| Aborting
  Running--> |On user initiated abort| Aborting
  Succeeding--> |On user initiated abort| Aborting
  Failing--> |If Failure node exists| HandleFailureNode
  Failing--> |On user initiated abort| Aborting
  HandleFailureNode--> |On completing failure node| Failed
  HandleFailureNode--> |On user initiated abort| Aborting
  Failing--> |On successful send of Failure node| Failed
  Aborting--> |On successful event send to Admin| Aborted
    

A workflow always starts in the Ready state and ends either in Failed, Succeeded, or Aborted state. Any system error within a state causes a retry on that state. These retries are capped by system retries which eventually lead to an Aborted state if the failure persists.

Every transition between states is recorded in FlyteAdmin using workflowexecutionevent.

The phases in the above state diagram are captured in the admin database as specified here workflowexecution.phase and are sent as a part of the Execution event.

The state machine specification for the illustration can be found here.

Node States#

        flowchart TD
  id1(( ))
  id1-->NotYetStarted
  id1-->|Will stop the node execution |Aborted
  NotYetStarted-->|If all upstream nodes are ready, i.e, inputs are ready | Queued
  NotYetStarted--> |If the branch was not taken |Skipped
  Queued-->|Start task execution- attempt 0 | Running
  Running-->|If task timeout has elapsed and retry_attempts >= max_retries|TimingOut
  Running-->|Internal state|Succeeding
  Running-->|For dynamic nodes generating workflows| DynamicRunning
  DynamicRunning-->TimingOut
  DynamicRunning-->RetryableFailure
  TimingOut-->|If total node timeout has elapsed|TimedOut
  DynamicRunning-->Succeeding
  Succeeding-->|User observes the task as succeeded| Succeeded
  Running-->|on retryable failure| RetryableFailure
  RetryableFailure-->|if retry_attempts < max_retries|Running
  RetryableFailure-->|retry_attempts >= max_retries|Failing
  Failing-->Failed
  Succeeded-->id2(( ))
  Failed-->id2(( ))
    

This state diagram illustrates the node transition through various states. This is the core finite state machine for a node. From the user’s perspective, a workflow simply consists of a sequence of tasks. But to Flyte, a workflow internally creates a meta entity known as node.

Once a Workflow enters the Running state, it triggers the phantom start node of the workflow. The start node is considered to be the entry node of any workflow. The start node begins by executing all its child-nodes using a modified Depth First Search algorithm recursively.

Nodes can be of different types as listed below, but all the nodes traverse through the same transitions:

  1. Start Node - Only exists during the execution and is not modeled in the core spec.

  2. Task Node

  3. Branch Node

  4. Workflow Node

  5. Dynamic Node - Just a task node that does not return output but constitutes a dynamic workflow. When the task runs, it remains in the RUNNING state. Once the task completes and Flyte starts executing the dynamic workflow, the overarching node that contains both the original task and the dynamic workflow enters DYNAMIC_RUNNING state.

  6. End Node - Only exists during the execution and is not modeled in the core spec

Every transition between states is recorded in FlyteAdmin using nodeexecutionevent.

Every NodeExecutionEvent can have any nodeexecution.phase.

Note

TODO: Add explanation for each phase.

The state machine specification for the illustration can be found here.

Task States#

        flowchart TD
  id1(( ))
  id1-->|Aborted by NodeHandler- timeouts, external abort, etc,.| NotReady
  id1-->Aborted
  NotReady-->|Optional-Blocked on resource quota or resource pool | WaitingForResources
  WaitingForResources--> |Optional- Has been submitted, but hasn't started |Queued
  Queued-->|Optional- Prestart initialization | Initializing
  Initializing-->|Actual execution of user code has started|Running
  Running-->|Successful execution|Success
  Running-->|Failed with a retryable error|RetryableFailure
  Running-->|Unrecoverable failure, will stop all execution|PermanentFailure
  Success-->id2(( ))
  RetryableFailure-->id2(( ))
  PermanentFailure-->id2(( ))
    

The state diagram above illustrates the various states through which a task transitions. This is the core finite state machine for a task.

Every transition between states is recorded in FlyteAdmin using taskexecutionevent.

Every TaskExecutionEvent can have any taskexecution.phase.

Note

TODO: Add explanation for each phase.

The state machine specification for the illustration can be found here.