Multi-GPU distributed training

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

These notebooks scale model training across multiple GPUs and nodes on AI Runtime. They cover the three main parallelism techniques, DDP, FSDP, and DeepSpeed ZeRO, using the serverless_gpu Python API on H100 GPUs.

note

Multi-GPU distributed training is supported on H100 GPUs.

Choose your parallelism technique

When scaling your model training across multiple GPUs, choosing the right parallelism technique depends on your model size, available GPU memory, and performance requirements.

Technique	When to use
DDP (Distributed Data Parallel)	Full model fits in single GPU memory; need to scale data throughput
FSDP (Fully Sharded Data Parallel)	Very large models that don't fit in single GPU memory
DeepSpeed ZeRO	Large models with advanced memory optimization needs

Technique	When to use
DDP (Distributed Data Parallel)	Full model fits in single GPU memory; need to scale data throughput
FSDP (Fully Sharded Data Parallel)	Very large models that don't fit in single GPU memory
DeepSpeed ZeRO	Large models with advanced memory optimization needs

For detailed information about each technique, see DDP, FSDP, and DeepSpeed.

Example notebooks by technique and framework

The following table organizes example notebooks by the framework/library you're using and the parallelism technique applied. Multiple notebooks may appear in a single cell.

Framework/Library	DDP examples	FSDP examples	DeepSpeed examples
PyTorch (native)	Simple MLP neural network RetinaNet image detection	10M parameter transformer	—
Huggingface TRL	Fine-tune Gpt OSS 20B	Fine-tune Gpt OSS 120B	Fine-tune Llama 3.2 1B
Unsloth	Fine-tune Llama 3.2 3B	—	—
Axolotl	Fine-tune Olmo3 7B	—	—
Mosaic LLM Foundry	Fine-tune Llama 3.2 8B	—	—
Lightning	Two-tower recommender system	—	—

Framework/Library	DDP examples	FSDP examples	DeepSpeed examples
PyTorch (native)	Simple MLP neural network RetinaNet image detection	10M parameter transformer	—
Huggingface TRL	Fine-tune Gpt OSS 20B	Fine-tune Gpt OSS 120B	Fine-tune Llama 3.2 1B
Unsloth	Fine-tune Llama 3.2 3B	—	—
Axolotl	Fine-tune Olmo3 7B	—	—
Mosaic LLM Foundry	Fine-tune Llama 3.2 8B	—	—
Lightning	Two-tower recommender system	—	—

Get started

Use the following tutorials to get started with the serverless GPU Python library for distributed training:

Tutorial	Description
AI Runtime with H100 GPUs	Learn how to use Databricks AI Runtime with H100 accelerators to run distributed GPU workloads using the serverless_gpu Python library.

Choose your parallelism technique​

Example notebooks by technique and framework​

Get started​

Choose your parallelism technique

Example notebooks by technique and framework

Get started