What is torch.distributed
?
torch.distributed
is PyTorch’s built-in module for distributed training, enabling parallel processing across multiple GPUs or machines. It supports different communication backends (like NCCL and Gloo) and provides tools for synchronizing gradients, data parallelism, and multi-node training.
Key Features:
- Multi-GPU & Multi-Node Training – Scale training across multiple devices.
- Communication Backends – Supports NCCL (optimized for NVIDIA GPUs) and Gloo (CPU-focused).
- Collective Operations – Includes
all_reduce
,broadcast
, andbarrier
for synchronization.
Code Examples
1. Initializing Distributed Training
import torch import torch.distributed as dist import os def setup(backend='gloo'): dist.init_process_group( backend=backend, init_method='env://', world_size=int(os.environ['WORLD_SIZE']), rank=int(os.environ['RANK']) ) # Example usage: if __name__ == "__main__": setup(backend='nccl') # Use 'gloo' for CPU training
2. Data Parallelism with DistributedDataParallel (DDP)
import torch.nn as nn from torch.nn.parallel import DistributedDataParallel as DDP model = nn.Linear(10, 10).cuda() ddp_model = DDP(model, device_ids=[torch.cuda.current_device()])
3. Synchronization with dist.barrier()
dist.barrier() # Waits for all processes to reach this point print("All processes synchronized!")
Common Methods in torch.distributed
Method | Description |
---|---|
init_process_group() | Initializes the distributed backend. |
all_reduce(tensor, op) | Aggregates tensors across all processes. |
broadcast(tensor, src) | Sends a tensor from src to all other processes. |
barrier() | Synchronizes all processes. |
is_initialized() | Checks if distributed training is set up. |
Errors & Debugging Tips
Common Errors:
- “Address already in use” → Fix: Set a different port or ensure proper cleanup.
- NCCL errors → Occurs if GPUs are not properly synchronized.
- Deadlocks → Caused by mismatched
barrier()
calls.
Debugging Tips:
✔ Use torch.distributed.is_initialized()
to verify setup.
✔ Check environment variables (RANK
, WORLD_SIZE
).
✔ Start with backend='gloo'
for CPU debugging before switching to NCCL.
✅ People Also Ask (FAQ)
1. What is PyTorch Distributed?
PyTorch Distributed is a module for parallel training across multiple GPUs/machines, supporting backends like NCCL and Gloo.
2. Is PyTorch a Frontend or Backend?
PyTorch is a frontend (deep learning framework), while backends like NCCL and Gloo handle communication.
3. NCCL vs. Gloo: What’s the Difference?
Backend | Best For | Key Features |
---|---|---|
NCCL | Multi-GPU (NVIDIA) | Optimized for GPU-to-GPU communication. |
Gloo | CPU & multi-node | Works on CPUs and supports basic collective ops. |
4. What Does torch.distributed.barrier()
Do?
It blocks all processes until every one reaches the barrier, ensuring synchronization before proceeding.
Conclusion
torch.distributed
is essential for scaling deep learning models. By mastering initialization, synchronization, and debugging, you can efficiently train models on multiple GPUs or nodes.