What is torch.distributed.tensor
?
torch.distributed.tensor
(also known as Distributed Tensor) is a PyTorch feature that enables efficient tensor operations across multiple GPUs or machines. It allows large tensors to be split (sharded) and processed in parallel, optimizing memory usage and computation speed in distributed training setups.
Key Features:
- Sharded Tensors – Splits a tensor across devices for parallel processing.
- Collective Operations – Supports distributed
matmul
,sum
,all_reduce
, etc. - Compatibility – Works with PyTorch’s
DistributedDataParallel
(DDP) andFSDP
(Fully Sharded Data Parallel).
Code Examples
1. Creating a Distributed Tensor
import torch import torch.distributed as dist from torch.distributed._tensor import DeviceMesh, Shard, distribute_tensor # Initialize distributed environment dist.init_process_group(backend="nccl") # Create a device mesh (logical view of devices) device_mesh = DeviceMesh("cuda", torch.arange(dist.get_world_size())) # Define a global tensor and shard it across devices global_tensor = torch.randn(4, 4) sharded_tensor = distribute_tensor( global_tensor, device_mesh, placements=[Shard(0)] # Shard along the first dimension ) print(f"Sharded Tensor: {sharded_tensor}")
2. Reshaping & Redistributing Tensors
# Resharding: Change distribution strategy resharded_tensor = sharded_tensor.redistribute( device_mesh, placements=[Shard(1)] # Now shard along the second dimension ) print(f"Resharded Tensor: {resharded_tensor}")
3. Distributed Matrix Multiplication
# Create another sharded tensor tensor2 = distribute_tensor( torch.randn(4, 4), device_mesh, placements=[Shard(1)] ) # Distributed matmul result = torch.matmul(sharded_tensor, tensor2) print(f"Distributed Matmul Result: {result}")
Common Methods
Method | Description |
---|---|
distribute_tensor() | Splits a tensor across devices. |
redistribute() | Changes sharding strategy. |
full_tensor() | Converts back to a single local tensor. |
all_reduce() | Aggregates values across processes. |
to_local() | Retrieves the local shard of a tensor. |
Errors & Debugging Tips
Common Errors:
- “Sharding dimension mismatch” → Ensure tensors are sharded along compatible axes.
- Deadlocks in collective ops → Check for mismatched
barrier()
calls. - NCCL errors → Verify GPU connectivity and backend initialization.
Debugging Tips:
✔ Use to_local()
to inspect shard values.
✔ Start with backend='gloo'
for CPU debugging.
✔ Check device_mesh
setup – Ensure ranks match available devices.
✅ People Also Ask (FAQ)
1. What is a Distributed Tensor in PyTorch?
A Distributed Tensor is a tensor split across multiple devices (GPUs/machines) for parallel computation.
2. How is torch.distributed.tensor
different from DDP?
- DDP replicates the entire model on each GPU.
- Distributed Tensor splits individual tensors for memory efficiency.
3. Can I use Distributed Tensors with FSDP?
Yes! Fully Sharded Data Parallel (FSDP) uses similar sharding principles.
4. What backends support Distributed Tensors?
- NCCL (best for GPU)
- Gloo (CPU-friendly)
5. How do I convert a sharded tensor back to normal?
Use full_tensor()
or to_local()
depending on your needs.
Conclusion
torch.distributed.tensor
unlocks scalable deep learning by optimizing memory and computation across devices. Mastering sharding, redistribution, and collective ops is key to efficient distributed training.