0 Comments

đź§  Introduction: What Is torch.distributed.checkpoint?

In large-scale distributed training, saving and restoring model state is not as simple as calling torch.save() and torch.load(). When training across many GPUs or nodes, traditional checkpointing becomes slow, memory-intensive, and often fails due to inconsistency or I/O bottlenecks.

That’s where torch.distributed.checkpoint comes in.

torch.distributed.checkpoint is a PyTorch module designed to efficiently save and load model checkpoints during distributed training. It supports sharded checkpoints, async I/O, tensor parallelism, and checkpointing across multiple ranks—making it ideal for large model training.

This module plays a crucial role in fault tolerance and resuming long-running training jobs.


🛠️ Code Examples: Saving and Loading Checkpoints with torch.distributed.checkpoint

Let’s go through a practical example using torch.distributed.checkpoint.

Step 1: Define a Distributed Environment

import torch
import torch.distributed as dist

dist.init_process_group(backend="nccl", init_method="env://")
torch.cuda.set_device(dist.get_rank())

Step 2: Define a Model and Optimizer

import torch.nn as nn

model = nn.Linear(512, 256).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Step 3: Saving the Checkpoint with torch.distributed.checkpoint

from torch.distributed.checkpoint import save_state_dict

state_dict = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"step": torch.tensor([1000]) # Example global step
}

save_state_dict(state_dict, storage_writer="checkpoint_dir/")

Step 4: Loading the Checkpoint

from torch.distributed.checkpoint import load_state_dict

state_dict = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"step": torch.tensor([0]) # Placeholder that will be overwritten
}

load_state_dict(state_dict, storage_reader="checkpoint_dir/")

✅ Note: This assumes you’ve prepared a sharded storage format in the folder checkpoint_dir/.


📚 Common Methods in torch.distributed.checkpoint

Method / FunctionDescription
save_state_dict(state_dict, ...)Saves a model/optimizer state dict to a distributed checkpoint.
load_state_dict(state_dict, ...)Loads a state dict from a distributed checkpoint.
FileSystemReader, FileSystemWriterInterface to handle I/O from files or cloud buckets.
StateDictTypeAllows specifying FSDP-aware checkpoints like SHARDED_STATE_DICT.
StorageWriter, StorageReaderInterfaces for I/O abstraction.

đź§© Optional: Use with Fully Sharded Data Parallel (FSDP)

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.checkpoint.state_dict import SHARDED_STATE_DICT

fsdp_model = FSDP(model)
state_dict = {
"model": fsdp_model.state_dict(state_dict_type=SHARDED_STATE_DICT)
}
save_state_dict(state_dict, storage_writer="checkpoint_dir/")

FSDP integration allows efficient memory use during checkpointing, particularly for massive models.


âť— Errors and Debugging Tips

❌ Error 1: RuntimeError: No storage writer provided

Fix: You must pass a valid writer like a file system path:

save_state_dict(..., storage_writer="my_checkpoints/")

❌ Error 2: Mismatch in model/optimizer keys

Cause: The keys in your saved checkpoint don’t match the current model.

Fix: Make sure both your training and loading scripts use the same model architecture.


❌ Error 3: torch.distributed.checkpoint not initialized properly

Fix: Always ensure your distributed process group is initialized before saving or loading checkpoints.

dist.init_process_group(backend='nccl', init_method='env://')

❌ Error 4: File not found or corrupt

Fix: Check the path to checkpoint_dir/ and ensure files are intact. You may need to validate with torch.load() for simpler components.


âś… People Also Ask (FAQ)

🔹 What Is torch.distributed.checkpoint?

It’s a PyTorch module used for saving and loading checkpoints in distributed training environments. It handles multi-GPU and multi-node checkpointing efficiently with support for sharding and async I/O.


🔹 Can I Use torch.distributed.checkpoint Without FSDP?

Yes! While it’s commonly used with FSDP for large models, it can be used standalone with nn.Module or DDP (DistributedDataParallel).


🔹 What’s the Difference Between torch.save and torch.distributed.checkpoint.save_state_dict?

  • torch.save: Saves a full model or optimizer to disk in a single file.
  • torch.distributed.checkpoint.save_state_dict: Saves checkpoints in a sharded and distributed format suitable for large-scale, multi-GPU training.

🔹 Does torch.distributed.checkpoint Support Cloud Storage?

Yes, via StorageWriter and StorageReader interfaces, you can write plugins for S3, Azure, or Google Cloud to save checkpoints directly to cloud storage.


🔹 How Can I Resume Training After a Crash?

Use load_state_dict() to restore the model and optimizer states, and reload any scalar values like the training step.

pythonCopyEditstep = state_dict["step"].item()

Then continue training from step.


🎯 Final Thoughts

If you’re training large models across multiple GPUs or nodes, using torch.distributed.checkpoint is no longer optional—it’s essential. It enables:

  • Faster I/O through sharding
  • Lower memory consumption
  • Resilient checkpointing
  • Seamless integration with FSDP and PyTorch’s distributed ecosystem

Whether you’re building a massive transformer model or scaling a custom architecture to a GPU cluster, torch.distributed.checkpoint is the reliable and efficient way to handle checkpoints.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts