13 April, 2025
0 Comments
2 categories
What is torch.distributed.elastic?
torch.distributed.elastic
is PyTorch’s framework for fault-tolerant, elastic distributed training that automatically adapts to cluster changes. Unlike static distributed training, elastic training:
- Handles node failures gracefully – Automatically recovers from worker crashes
- Supports dynamic scaling – Adjusts to adding/removing workers mid-training
- Maintains training continuity – Preserves progress across interruptions
- Works with cloud environments – Ideal for spot instances and preemptible VMs
Key Components
- Agents (coordinator processes)
- Worker groups (elastic groups of processes)
- Rendezvous (dynamic worker discovery)
- Failure handlers (automatic recovery)
Code Examples
1. Basic Elastic Training Setup
import torch.distributed.elastic as elastic from torch.nn.parallel import DistributedDataParallel as DDP def train_loop(): # Initialize elastic process group elastic.init_process_group(backend="nccl") model = DDP(MyModel().cuda()) optimizer = torch.optim.Adam(model.parameters()) # Your training logic here for epoch in range(epochs): for batch in dataloader: # Standard training steps ... # Launch with elastic agent spec = elastic.agent.server.WorkerSpec( role="trainer", local_world_size=4, # GPUs per node entrypoint=train_loop ) agent = elastic.agent.server.LocalElasticAgent(spec) agent.run()
2. Custom Rendezvous Backend
from torch.distributed.elastic.rendezvous import RendezvousHandler class CustomRendezvous(RendezvousHandler): def next_rendezvous(self): # Implement custom worker discovery return store, rank, world_size # Configure elastic to use custom rendezvous elastic.rendezvous.registry.register("custom", CustomRendezvous)
3. Handling Worker Failures
from torch.distributed.elastic.multiprocessing import Std # Configure error handling mp = elastic.multiprocessing.Std( entrypoint=train_loop, log_dir="./logs", monitor_interval=5, # Check worker health every 5s max_restarts=3 # Maximum restart attempts ) result = mp.run()
Common Methods & Components
Component/Method | Purpose |
---|---|
LocalElasticAgent | Coordinates local workers |
WorkerSpec | Defines worker configuration |
RendezvousHandler | Manages worker discovery |
init_process_group() | Elastic-aware initialization |
record() | Tracks training state for recovery |
Errors & Debugging Tips
Common Issues
- Rendezvous timeouts – Workers fail to discover each other
- Version mismatches – Different PyTorch versions across nodes
- Partial failures – Some workers crash while others continue
- Checkpoint conflicts – Multiple workers trying to save simultaneously
Category: Pytorch Tutorials, Tutorials