Distributed training using DeepSpeed

Beta

This feature is in Beta.

This page has notebook examples for distributed training using DeepSpeed on Serverless GPU compute. DeepSpeed provides advanced memory optimization techniques through its ZeRO (Zero Redundancy Optimizer) stages, enabling efficient training of large models.

When to use DeepSpeed

Use DeepSpeed when:

You need advanced memory optimization beyond standard FSDP
You want fine-grained control over optimizer state sharding (ZeRO Stage 1, 2, or 3)
You need additional features like gradient accumulation fusion or CPU offloading
You're working with large language models (1B to 100B+ parameters)

For simpler use cases, consider DDP. For PyTorch-native large model training, see FSDP.

Supervised fine-tuning using TRL and DeepSpeed ZeRO Stage 3

This notebook demonstrates how to use the Serverless GPU Python API to run supervised fine-tuning (SFT) using the Transformer Reinforcement Learning (TRL) library with DeepSpeed ZeRO Stage 3 optimization on a single node A10 GPU. This approach can be extended to multi-node setups.

TRL DeepSpeed

Open notebook in new tab

When to use DeepSpeed​

Supervised fine-tuning using TRL and DeepSpeed ZeRO Stage 3​

TRL DeepSpeed

When to use DeepSpeed

Supervised fine-tuning using TRL and DeepSpeed ZeRO Stage 3