Multi-GPU and multi-node distributed training

Beta

This feature is in Beta.

This page has notebook examples for multi-node and multi-GPU distributed training using Serverless GPU compute. These examples demonstrate how to scale training across multiple GPUs and nodes for improved performance.

note

Multi-node distributed training is currently only supported on A10 GPUs. Multi-GPU distributed training is supported on both A10 and H100 GPUs.

Choose your parallelism technique

When scaling your model training across multiple GPUs, choosing the right parallelism technique depends on your model size, available GPU memory, and performance requirements.

Technique	When to use
DDP (Distributed Data Parallel)	Full model fits in single GPU memory; need to scale data throughput
FSDP (Fully Sharded Data Parallel)	Very large models that don't fit in single GPU memory
DeepSpeed ZeRO	Large models with advanced memory optimization needs

For detailed information about each technique, see DDP, FSDP, and DeepSpeed.

Example notebooks by technique and framework

The following table organizes example notebooks by the framework/library you're using and the parallelism technique applied. Multiple notebooks may appear in a single cell.

Framework/Library	DDP examples	FSDP examples	DeepSpeed examples
PyTorch (native)	Simple MLP neural network	10M parameter transformer	—
Huggingface TRL	Fine-tune Gpt OSS 20B	Fine-tune Gpt OSS 120B	Fine-tune Llama 3.2 1B
Unsloth	Fine-tune Llama 3.2 3B	—	—
Ray Train	ResNet18 on FashionMNIST (computer vision)	—	—
Lightning	Two-tower recommender system	—	—

Get started

The following notebook has a basic example of how to use the Serverless GPU Python API to launch multiple A10 GPUs for distributed training.

Serverless GPU API: A10 starter

Open notebook in new tab

The following notebook has a basic example of how to use the Serverless GPU Python API to launch multiple H100 GPUs for distributed training.

Serverless GPU API: H100 starter

Open notebook in new tab

Choose your parallelism technique​

Example notebooks by technique and framework​

Get started​

Serverless GPU API: A10 starter

Serverless GPU API: H100 starter

Choose your parallelism technique

Example notebooks by technique and framework

Get started