0 Comments

Introduction: What Is torch.distributed.tensor.parallel?

torch.distributed.tensor.parallel is a module in PyTorch that provides tools to implement tensor parallelism—a technique used to split large model tensors (e.g., weights) across multiple GPUs. Unlike data parallelism, where each GPU holds a copy of the model and processes different inputs, tensor parallelism divides the model itself across devices to handle larger model sizes and speed up computation.

Tensor parallelism is especially useful in training large-scale transformer models like GPT and BERT, where matrix multiplications in layers (e.g., attention heads, MLPs) can be distributed across devices.


🚀 Code Examples: Creating and Reshaping Tensors with Tensor Parallelism

Here’s a basic setup for using tensor parallelism with PyTorch’s distributed framework:

🛠️ Step 1: Initialize the Distributed Environment

import torch.distributed as dist
from torch.distributed.tensor.parallel import parallelize_module

dist.init_process_group(backend="nccl", init_method="env://")
torch.cuda.set_device(dist.get_rank())

💡 Make sure to set environment variables like RANK, WORLD_SIZE, and MASTER_ADDR before running the script.

🧠 Step 2: Define a Simple Model

import torch.nn as nn

class SimpleMLP(nn.Module):
def __init__(self, hidden_size):
super().__init__()
self.fc1 = nn.Linear(hidden_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, hidden_size)

def forward(self, x):
return self.fc2(torch.relu(self.fc1(x)))

🔄 Step 3: Apply Tensor Parallelism

from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel

model = SimpleMLP(hidden_size=1024).cuda()

tp_model = parallelize_module(
model,
device=torch.cuda.current_device(),
parallelize_plan={
"fc1": ColwiseParallel(),
"fc2": RowwiseParallel(),
}
)

Now, fc1 and fc2 are sharded across GPUs according to column-wise and row-wise splitting strategies.


🧰 Common Methods in torch.distributed.tensor.parallel

  • ColwiseParallel(): Splits the weight matrix column-wise across devices.
  • RowwiseParallel(): Splits the weight matrix row-wise.
  • parallelize_module(): The main API to apply parallel strategies on submodules.
  • gather(): Used to collect outputs from multiple ranks.
  • scatter(): Used to distribute tensors across devices.

❗ Common Errors and Debugging Tips

🔧 1. Mismatch in Tensor Shapes

Error: RuntimeError: mat1 and mat2 shapes cannot be multiplied

Fix: Ensure that the input and output dimensions align after splitting. Use ColwiseParallel and RowwiseParallel appropriately based on input/output tensor layout.


🔧 2. CUDA Device Not Set

Error: Expected all tensors to be on the same device

Fix: Use torch.cuda.set_device(rank) to ensure each process works on the correct GPU.


🔧 3. Inconsistent Parallel Plan

Error: KeyError: 'fc1' not found in model

Fix: Make sure your module names match exactly when defining the parallelize_plan.


🔧 4. Process Group Issues

Error: RuntimeError: NCCL error

Fix: Ensure you have the correct environment variables for distributed setup:

bashCopyEditexport RANK=0
export WORLD_SIZE=2
export MASTER_ADDR=localhost
export MASTER_PORT=12355

❓ People Also Ask (FAQ)

✅ Is FSDP Tensor Parallelism?

No, FSDP (Fully Sharded Data Parallel) and Tensor Parallelism are different.

  • FSDP shreds entire model layers (parameters and gradients) across devices, focusing on memory efficiency.
  • Tensor Parallelism splits large individual tensors (e.g., weight matrices) across devices to allow larger models to be trained efficiently.

They can be combined in advanced training setups to get the best of both worlds.


✅ How Does PyTorch Distributed Data Parallel Work?

DistributedDataParallel (DDP) replicates the model on each GPU and trains it with a different mini-batch of data. Gradients are averaged across GPUs at each step to ensure consistency.

  • Simple to use with model = DDP(model)
  • Works well for data-parallel tasks
  • Does not inherently scale model size like tensor parallelism does

✅ How to Implement Tensor Parallelism in PyTorch?

Here’s a step-by-step summary:

  1. Initialize the distributed environment: pythonCopyEditdist.init_process_group(backend="nccl")
  2. Define your model and move it to the correct device.
  3. Apply tensor parallelism using parallelize_module(): pythonCopyEditparallelize_module(model, ..., parallelize_plan={...})
  4. Train as normal, but remember that your model is now split across GPUs.

📌 Final Thoughts

torch.distributed.tensor.parallel is a powerful tool for scaling up deep learning models in PyTorch. It provides fine-grained control over how model weights are distributed across devices, making it ideal for large model training where memory or compute limits would otherwise be a bottleneck.

Whether you’re working on transformer models or other architectures with large weight matrices, learning tensor parallelism gives you an edge in performance and scalability.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts