What is torch.amp in PyTorch?
torch.amp
(Automatic Mixed Precision) is a PyTorch module that speeds up neural network training while maintaining accuracy by strategically using different numerical precisions:
- FP16 (16-bit floats) for faster computations
- FP32 (32-bit floats) for precision-critical operations
Key benefits:
- ✅ 1.5-3x faster training on compatible GPUs (NVIDIA Tensor Cores)
- ✅ Reduced memory usage (smaller model footprints)
- ✅ Minimal accuracy loss when configured properly
Code Examples: Using torch.amp
1. Basic Autocast Usage
import torch from torch.cuda.amp import autocast # Create model and optimizer model = torch.nn.Linear(100, 50).cuda() optimizer = torch.optim.Adam(model.parameters()) # Training loop with AMP for inputs, targets in dataloader: inputs, targets = inputs.cuda(), targets.cuda() with autocast(): # Automatic precision selection outputs = model(inputs) loss = torch.nn.functional.cross_entropy(outputs, targets) optimizer.zero_grad() loss.backward() optimizer.step()
2. Gradient Scaling (Preventing Underflow)
from torch.cuda.amp import GradScaler scaler = GradScaler() # Prevents gradient underflow in FP16 for inputs, targets in dataloader: with autocast(): outputs = model(inputs) loss = loss_fn(outputs, targets) scaler.scale(loss).backward() # Scaled backward pass scaler.step(optimizer) # Scaled optimizer step scaler.update() # Adjusts scale factor
3. Mixed Precision Inference
@torch.inference_mode() def predict(inputs): with autocast(): return model(inputs.cuda())
Common Methods in torch.amp
Method/Class | Purpose | When to Use |
---|---|---|
autocast() | Automatic precision selection | Wrapping forward passes |
GradScaler() | Manages gradient scaling | Required for most FP16 training |
custom_fwd() | Custom forward precision rules | When overriding autograd |
custom_bwd() | Custom backward precision rules | Advanced gradient control |
Errors & Debugging Tips
Common Errors
- “CUDA error: operation not permitted”
- Cause: Using non-CUDA tensors in autocast
- Fix: Ensure all tensors are on GPU (
.cuda()
)
- NaN losses appearing suddenly
- Cause: Gradient underflow without scaling
- Fix: Always use
GradScaler
with FP16
- “RuntimeError: expected scalar type Float”
- Cause: Manual dtype mismatches in autocast
- Fix: Let autocast handle dtype conversions
Debugging Checklist
- ✔️ Verify GPU compatibility (NVIDIA Volta+ recommended)
- ✔️ Check
scaler.is_enabled()
status - ✔️ Monitor loss scale with
scaler.get_scale()
- ✔️ Compare FP32 vs AMP validation accuracy
✅ People Also Ask (FAQ)
1. How does PyTorch AMP work?
PyTorch AMP automatically:
- Runs forward passes in FP16 where safe
- Keeps critical ops (softmax, reductions) in FP32
- Scales gradients to prevent underflow
- Updates weights in FP32
2. What is AMP in deep learning?
AMP (Automatic Mixed Precision) is a technique that:
- Combines FP16 speed with FP32 stability
- Requires no manual dtype management
- Works best on modern NVIDIA GPUs
3. What is a Torch device?
A compute target for tensors:
torch.device('cpu')
for CPU executiontorch.device('cuda:0')
for GPU 0- AMP requires CUDA devices
4. What is Torch Autocast?
A context manager that:
- Automatically selects FP16/FP32 per operation
- Handles dtype conversions transparently
- Should wrap forward passes only
5. Does AMP work on CPUs?
No, torch.amp
requires:
- NVIDIA GPU with Tensor Cores
- CUDA enabled in PyTorch
- Compute capability 7.0+
6. When should I NOT use AMP?
Avoid when:
- Using custom ops without FP16 support
- Training extremely small networks
- Working on CPU-only systems
7. How much speedup can I expect?
Typical results:
- 1.5-2x faster on Volta GPUs
- 2-3x faster on Ampere GPUs
- 30-50% memory reduction
Best Practices for torch.amp
- Always use GradScaler for stable training
- Benchmark first – compare FP32 vs AMP accuracy
- Profile your model – identify AMP-friendly layers
- Watch for NaN values – indicates scaling issues
- Use torch.backends.cuda.matmul.allow_tf32 = True for extra speed (Ampere+)
Conclusion
torch.amp
is a game-changer for PyTorch performance, offering near-free speedups through intelligent precision management. By combining autocast
with GradScaler
, developers can achieve significant training acceleration while maintaining model accuracy.
Pro Tip: Start with AMP disabled to establish a baseline, then incrementally enable features while monitoring validation metrics. Many modern PyTorch models (like HuggingFace Transformers) now include built-in AMP support for out-of-the-box optimization.