What is torch.distributed?
torch.distributed is PyTorch’s built-in module for distributed training, enabling parallel processing across multiple GPUs or machines. It supports different communication backends (like NCCL and Gloo) and provides tools for synchronizing gradients, data parallelism, and multi-node training.
Key Features:
- Multi-GPU & Multi-Node Training – Scale training across multiple devices.
- Communication Backends – Supports NCCL (optimized for NVIDIA GPUs) and Gloo (CPU-focused).
- Collective Operations – Includes
all_reduce,broadcast, andbarrierfor synchronization.
Code Examples
1. Initializing Distributed Training
import torch
import torch.distributed as dist
import os
def setup(backend='gloo'):
dist.init_process_group(
backend=backend,
init_method='env://',
world_size=int(os.environ['WORLD_SIZE']),
rank=int(os.environ['RANK'])
)
# Example usage:
if __name__ == "__main__":
setup(backend='nccl') # Use 'gloo' for CPU training
2. Data Parallelism with DistributedDataParallel (DDP)
import torch.nn as nn from torch.nn.parallel import DistributedDataParallel as DDP model = nn.Linear(10, 10).cuda() ddp_model = DDP(model, device_ids=[torch.cuda.current_device()])
3. Synchronization with dist.barrier()
dist.barrier() # Waits for all processes to reach this point
print("All processes synchronized!")
Common Methods in torch.distributed
| Method | Description |
|---|---|
init_process_group() | Initializes the distributed backend. |
all_reduce(tensor, op) | Aggregates tensors across all processes. |
broadcast(tensor, src) | Sends a tensor from src to all other processes. |
barrier() | Synchronizes all processes. |
is_initialized() | Checks if distributed training is set up. |
Errors & Debugging Tips
Common Errors:
- “Address already in use” → Fix: Set a different port or ensure proper cleanup.
- NCCL errors → Occurs if GPUs are not properly synchronized.
- Deadlocks → Caused by mismatched
barrier()calls.
Debugging Tips:
✔ Use torch.distributed.is_initialized() to verify setup.
✔ Check environment variables (RANK, WORLD_SIZE).
✔ Start with backend='gloo' for CPU debugging before switching to NCCL.
✅ People Also Ask (FAQ)
1. What is PyTorch Distributed?
PyTorch Distributed is a module for parallel training across multiple GPUs/machines, supporting backends like NCCL and Gloo.
2. Is PyTorch a Frontend or Backend?
PyTorch is a frontend (deep learning framework), while backends like NCCL and Gloo handle communication.
3. NCCL vs. Gloo: What’s the Difference?
| Backend | Best For | Key Features |
|---|---|---|
| NCCL | Multi-GPU (NVIDIA) | Optimized for GPU-to-GPU communication. |
| Gloo | CPU & multi-node | Works on CPUs and supports basic collective ops. |
4. What Does torch.distributed.barrier() Do?
It blocks all processes until every one reaches the barrier, ensuring synchronization before proceeding.
Conclusion
torch.distributed is essential for scaling deep learning models. By mastering initialization, synchronization, and debugging, you can efficiently train models on multiple GPUs or nodes.