Distributed training using DeepSpeed
Beta
This feature is in Beta.
This page has notebook examples for distributed training using DeepSpeed on Serverless GPU compute. DeepSpeed provides advanced memory optimization techniques through its ZeRO (Zero Redundancy Optimizer) stages, enabling efficient training of large models.
When to use DeepSpeed
Use DeepSpeed when:
- You need advanced memory optimization beyond standard FSDP
- You want fine-grained control over optimizer state sharding (ZeRO Stage 1, 2, or 3)
- You need additional features like gradient accumulation fusion or CPU offloading
- You're working with large language models (1B to 100B+ parameters)
For simpler use cases, consider DDP. For PyTorch-native large model training, see FSDP.
Supervised fine-tuning using TRL and DeepSpeed ZeRO Stage 3
This notebook demonstrates how to use the Serverless GPU Python API to run supervised fine-tuning (SFT) using the Transformer Reinforcement Learning (TRL) library with DeepSpeed ZeRO Stage 3 optimization on a single node A10 GPU. This approach can be extended to multi-node setups.