class flytekitplugins.kfpytorch.Elastic(nnodes=1, nproc_per_node=1, start_method='spawn', monitor_interval=5, max_restarts=0, rdzv_configs=<factory>)[source]#

Configuration for torch elastic training.

Use this to run single- or multi-node distributed pytorch elastic training on k8s.

Single-node elastic training is executed in a k8s pod when nnodes is set to 1. Multi-node training is executed otherwise using a Pytorch Job.

  • nnodes (Union[int, str]) – Number of nodes, or the range of nodes in form <minimum_nodes>:<maximum_nodes>.

  • nproc_per_node (str) – Number of workers per node.

  • start_method (str) – Multiprocessing start method to use when creating workers.

  • monitor_interval (int) – Interval, in seconds, to monitor the state of workers.

  • max_restarts (int) – Maximum number of worker group restarts before failing.

  • rdzv_configs (Dict[str, Any]) – Additional rendezvous configs to pass to torch elastic, e.g. {“timeout”: 1200, “join_timeout”: 900}. See torch.distributed.launcher.api.LaunchConfig and torch.distributed.elastic.rendezvous.dynamic_rendezvous.create_handler.

Return type




max_restarts: int = 0
monitor_interval: int = 5
nnodes: Union[int, str] = 1
nproc_per_node: int = 1
start_method: str = 'spawn'
rdzv_configs: Dict[str, Any]