flytekitplugins.kfpytorch.Elastic¶

class flytekitplugins.kfpytorch.Elastic(nnodes=1, nproc_per_node=1, start_method='spawn', monitor_interval=5, max_restarts=0, rdzv_configs=<factory>)[source]¶

Configuration for torch elastic training.

Use this to run single- or multi-node distributed pytorch elastic training on k8s.

Single-node elastic training is executed in a k8s pod when nnodes is set to 1. Multi-node training is executed otherwise using a Pytorch Job.

Parameters:

nnodes (Union[int, str]) – Number of nodes, or the range of nodes in form <minimum_nodes>:<maximum_nodes>.
nproc_per_node (str) – Number of workers per node.
start_method (str) – Multiprocessing start method to use when creating workers.
monitor_interval (int) – Interval, in seconds, to monitor the state of workers.
max_restarts (int) – Maximum number of worker group restarts before failing.
rdzv_configs (Dict[str, Any]) – Additional rendezvous configs to pass to torch elastic, e.g. {“timeout”: 1200, “join_timeout”: 900}. See torch.distributed.launcher.api.LaunchConfig and torch.distributed.elastic.rendezvous.dynamic_rendezvous.create_handler.

Methods

Attributes

max_restarts: int = 0

monitor_interval: int = 5

nnodes: int | str = 1

nproc_per_node: int = 1

start_method: str = 'spawn'

rdzv_configs: Dict[str, Any]