Fully Sharded Data Parallel (FSDP) training

Beta

This feature is in Beta.

This page has notebook examples for using Fully Sharded Data Parallel (FSDP) training on Serverless GPU compute. FSDP shards model parameters, gradients, and optimizer states across GPUs, enabling training of very large models that don't fit in a single GPU's memory.

When to use FSDP

Use FSDP when:

Your model is too large to fit in a single GPU's memory
You need to train models in the 20B to 120B+ parameter range
You want more memory efficiency than DDP provides

For smaller models that fit in single GPU memory, consider DDP for simplicity. For advanced memory optimization features, see DeepSpeed.

Training a Transformer model with 10-million parameters using FSDP2

The following notebook demonstrates distributed training of a 10-million parameter Transformer model using FSDP2 library.

PyTorch FSDP

Open notebook in new tab

Training OpenAI GPT-OSS 120B model using TRL and FSDP

This notebook demonstrates how to run supervised fine-tuning (SFT) on a GPT-OSS 120B model using FSDP2 and the Transformer Reinforcement Learning (TRL) library. This example leverages FSDP to reduce memory consumption and DDP to scale the global batch size across 8 H100 GPUs.

TRL FSDP

Open notebook in new tab

When to use FSDP​

Training a Transformer model with 10-million parameters using FSDP2​

PyTorch FSDP

Training OpenAI GPT-OSS 120B model using TRL and FSDP​

TRL FSDP

When to use FSDP

Training a Transformer model with 10-million parameters using FSDP2

Training OpenAI GPT-OSS 120B model using TRL and FSDP