flytekitplugins.kfpytorch.Elastic¶
- class flytekitplugins.kfpytorch.Elastic(nnodes=1, nproc_per_node=1, start_method='spawn', monitor_interval=5, max_restarts=0, rdzv_configs=<factory>)[source]¶
Configuration for torch elastic training.
Use this to run single- or multi-node distributed pytorch elastic training on k8s.
Single-node elastic training is executed in a k8s pod when nnodes is set to 1. Multi-node training is executed otherwise using a Pytorch Job.
- Parameters:
nnodes (Union[int, str]) – Number of nodes, or the range of nodes in form <minimum_nodes>:<maximum_nodes>.
nproc_per_node (str) – Number of workers per node.
start_method (str) – Multiprocessing start method to use when creating workers.
monitor_interval (int) – Interval, in seconds, to monitor the state of workers.
max_restarts (int) – Maximum number of worker group restarts before failing.
rdzv_configs (Dict[str, Any]) – Additional rendezvous configs to pass to torch elastic, e.g. {“timeout”: 1200, “join_timeout”: 900}. See torch.distributed.launcher.api.LaunchConfig and torch.distributed.elastic.rendezvous.dynamic_rendezvous.create_handler.
Methods
Attributes
- max_restarts: int = 0
- monitor_interval: int = 5
- nproc_per_node: int = 1
- start_method: str = 'spawn'