Skip to main content

Fully Sharded Data Parallel (FSDP) training

Beta

This feature is in Beta.

This page has notebook examples for using Fully Sharded Data Parallel (FSDP) training on Serverless GPU compute. These examples demonstrate how to scale training across multiple GPUs and nodes for improved performance.

Training a Transformer model with 10-million parameters using FSDP2

The following notebook demonstrates distributed training of a 10-million parameter Transformer model using FSDP2 library.

Notebook

Open notebook in new tab