đź§ Introduction: What Is torch.distributed.checkpoint
?
In large-scale distributed training, saving and restoring model state is not as simple as calling torch.save()
and torch.load()
. When training across many GPUs or nodes, traditional checkpointing becomes slow, memory-intensive, and often fails due to inconsistency or I/O bottlenecks.
That’s where torch.distributed.checkpoint
comes in.
torch.distributed.checkpoint
is a PyTorch module designed to efficiently save and load model checkpoints during distributed training. It supports sharded checkpoints, async I/O, tensor parallelism, and checkpointing across multiple ranks—making it ideal for large model training.
This module plays a crucial role in fault tolerance and resuming long-running training jobs.
🛠️ Code Examples: Saving and Loading Checkpoints with torch.distributed.checkpoint
Let’s go through a practical example using torch.distributed.checkpoint
.
Step 1: Define a Distributed Environment
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl", init_method="env://")
torch.cuda.set_device(dist.get_rank())
Step 2: Define a Model and Optimizer
import torch.nn as nn
model = nn.Linear(512, 256).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
Step 3: Saving the Checkpoint with torch.distributed.checkpoint
from torch.distributed.checkpoint import save_state_dict
state_dict = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"step": torch.tensor([1000]) # Example global step
}
save_state_dict(state_dict, storage_writer="checkpoint_dir/")
Step 4: Loading the Checkpoint
from torch.distributed.checkpoint import load_state_dict
state_dict = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"step": torch.tensor([0]) # Placeholder that will be overwritten
}
load_state_dict(state_dict, storage_reader="checkpoint_dir/")
✅ Note: This assumes you’ve prepared a sharded storage format in the folder
checkpoint_dir/
.
📚 Common Methods in torch.distributed.checkpoint
Method / Function | Description |
---|---|
save_state_dict(state_dict, ...) | Saves a model/optimizer state dict to a distributed checkpoint. |
load_state_dict(state_dict, ...) | Loads a state dict from a distributed checkpoint. |
FileSystemReader , FileSystemWriter | Interface to handle I/O from files or cloud buckets. |
StateDictType | Allows specifying FSDP-aware checkpoints like SHARDED_STATE_DICT . |
StorageWriter , StorageReader | Interfaces for I/O abstraction. |
đź§© Optional: Use with Fully Sharded Data Parallel (FSDP)
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.checkpoint.state_dict import SHARDED_STATE_DICT
fsdp_model = FSDP(model)
state_dict = {
"model": fsdp_model.state_dict(state_dict_type=SHARDED_STATE_DICT)
}
save_state_dict(state_dict, storage_writer="checkpoint_dir/")
FSDP integration allows efficient memory use during checkpointing, particularly for massive models.
âť— Errors and Debugging Tips
❌ Error 1: RuntimeError: No storage writer provided
Fix: You must pass a valid writer like a file system path:
save_state_dict(..., storage_writer="my_checkpoints/")
❌ Error 2: Mismatch in model/optimizer keys
Cause: The keys in your saved checkpoint don’t match the current model.
Fix: Make sure both your training and loading scripts use the same model architecture.
❌ Error 3: torch.distributed.checkpoint not initialized properly
Fix: Always ensure your distributed process group is initialized before saving or loading checkpoints.
dist.init_process_group(backend='nccl', init_method='env://')
❌ Error 4: File not found or corrupt
Fix: Check the path to checkpoint_dir/
and ensure files are intact. You may need to validate with torch.load()
for simpler components.
âś… People Also Ask (FAQ)
🔹 What Is torch.distributed.checkpoint
?
It’s a PyTorch module used for saving and loading checkpoints in distributed training environments. It handles multi-GPU and multi-node checkpointing efficiently with support for sharding and async I/O.
🔹 Can I Use torch.distributed.checkpoint
Without FSDP?
Yes! While it’s commonly used with FSDP for large models, it can be used standalone with nn.Module
or DDP
(DistributedDataParallel).
🔹 What’s the Difference Between torch.save
and torch.distributed.checkpoint.save_state_dict
?
torch.save
: Saves a full model or optimizer to disk in a single file.torch.distributed.checkpoint.save_state_dict
: Saves checkpoints in a sharded and distributed format suitable for large-scale, multi-GPU training.
🔹 Does torch.distributed.checkpoint
Support Cloud Storage?
Yes, via StorageWriter
and StorageReader
interfaces, you can write plugins for S3, Azure, or Google Cloud to save checkpoints directly to cloud storage.
🔹 How Can I Resume Training After a Crash?
Use load_state_dict()
to restore the model and optimizer states, and reload any scalar values like the training step.
pythonCopyEditstep = state_dict["step"].item()
Then continue training from step
.
🎯 Final Thoughts
If you’re training large models across multiple GPUs or nodes, using torch.distributed.checkpoint
is no longer optional—it’s essential. It enables:
- Faster I/O through sharding
- Lower memory consumption
- Resilient checkpointing
- Seamless integration with FSDP and PyTorch’s distributed ecosystem
Whether you’re building a massive transformer model or scaling a custom architecture to a GPU cluster, torch.distributed.checkpoint
is the reliable and efficient way to handle checkpoints.