Fully Sharded Data Parallel (FSDP) training
This feature is in Beta.
This page has notebook examples for using Fully Sharded Data Parallel (FSDP) training on Serverless GPU compute. FSDP shards model parameters, gradients, and optimizer states across GPUs, enabling training of very large models that don't fit in a single GPU's memory.
When to use FSDP
Use FSDP when:
- Your model is too large to fit in a single GPU's memory
- You need to train models in the 20B to 120B+ parameter range
- You want more memory efficiency than DDP provides
For smaller models that fit in single GPU memory, consider DDP for simplicity. For advanced memory optimization features, see DeepSpeed.
Training a Transformer model with 10-million parameters using FSDP2
The following notebook demonstrates distributed training of a 10-million parameter Transformer model using FSDP2 library.
PyTorch FSDP
Training OpenAI GPT-OSS 120B model using TRL and FSDP
This notebook demonstrates how to run supervised fine-tuning (SFT) on a GPT-OSS 120B model using FSDP2 and the Transformer Reinforcement Learning (TRL) library. This example leverages FSDP to reduce memory consumption and DDP to scale the global batch size across 8 H100 GPUs.