0 Comments

What is torch.distributed.tensor?

torch.distributed.tensor (also known as Distributed Tensor) is a PyTorch feature that enables efficient tensor operations across multiple GPUs or machines. It allows large tensors to be split (sharded) and processed in parallel, optimizing memory usage and computation speed in distributed training setups.

Key Features:

  • Sharded Tensors – Splits a tensor across devices for parallel processing.
  • Collective Operations – Supports distributed matmulsumall_reduce, etc.
  • Compatibility – Works with PyTorch’s DistributedDataParallel (DDP) and FSDP (Fully Sharded Data Parallel).

Code Examples

1. Creating a Distributed Tensor

import torch
import torch.distributed as dist
from torch.distributed._tensor import DeviceMesh, Shard, distribute_tensor

# Initialize distributed environment
dist.init_process_group(backend="nccl")

# Create a device mesh (logical view of devices)
device_mesh = DeviceMesh("cuda", torch.arange(dist.get_world_size()))

# Define a global tensor and shard it across devices
global_tensor = torch.randn(4, 4)
sharded_tensor = distribute_tensor(
    global_tensor,
    device_mesh,
    placements=[Shard(0)]  # Shard along the first dimension
)

print(f"Sharded Tensor: {sharded_tensor}")

2. Reshaping & Redistributing Tensors

# Resharding: Change distribution strategy
resharded_tensor = sharded_tensor.redistribute(
    device_mesh,
    placements=[Shard(1)]  # Now shard along the second dimension
)

print(f"Resharded Tensor: {resharded_tensor}")

3. Distributed Matrix Multiplication

# Create another sharded tensor
tensor2 = distribute_tensor(
    torch.randn(4, 4),
    device_mesh,
    placements=[Shard(1)]
)

# Distributed matmul
result = torch.matmul(sharded_tensor, tensor2)
print(f"Distributed Matmul Result: {result}")

Common Methods

MethodDescription
distribute_tensor()Splits a tensor across devices.
redistribute()Changes sharding strategy.
full_tensor()Converts back to a single local tensor.
all_reduce()Aggregates values across processes.
to_local()Retrieves the local shard of a tensor.

Errors & Debugging Tips

Common Errors:

  1. “Sharding dimension mismatch” → Ensure tensors are sharded along compatible axes.
  2. Deadlocks in collective ops → Check for mismatched barrier() calls.
  3. NCCL errors → Verify GPU connectivity and backend initialization.

Debugging Tips:

✔ Use to_local() to inspect shard values.
✔ Start with backend='gloo' for CPU debugging.
✔ Check device_mesh setup – Ensure ranks match available devices.


✅ People Also Ask (FAQ)

1. What is a Distributed Tensor in PyTorch?

Distributed Tensor is a tensor split across multiple devices (GPUs/machines) for parallel computation.

2. How is torch.distributed.tensor different from DDP?

  • DDP replicates the entire model on each GPU.
  • Distributed Tensor splits individual tensors for memory efficiency.

3. Can I use Distributed Tensors with FSDP?

Yes! Fully Sharded Data Parallel (FSDP) uses similar sharding principles.

4. What backends support Distributed Tensors?

  • NCCL (best for GPU)
  • Gloo (CPU-friendly)

5. How do I convert a sharded tensor back to normal?

Use full_tensor() or to_local() depending on your needs.


Conclusion

torch.distributed.tensor unlocks scalable deep learning by optimizing memory and computation across devices. Mastering sharding, redistribution, and collective ops is key to efficient distributed training.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts