Skip to main content

Distributed Data Parallel (DDP) training

Beta

This feature is in Beta.

This page has notebook examples for using Distributed Data Parallel (DDP) training on Serverless GPU compute. DDP is the most common parallelism technique for distributed training, where the full model is replicated on each GPU and data batches are split across GPUs.

When to use DDP

Use DDP when:

  • Your model fits completely in a single GPU's memory
  • You want to scale training by increasing data throughput
  • You need the simplest distributed training approach with automatic support in most frameworks

For larger models that don't fit in single GPU memory, consider FSDP or DeepSpeed instead.

Training a simple multilayer perceptron (MLP) neural network using PyTorch DDP

The following notebook demonstrates distributed training of a simple multilayer perceptron (MLP) neural network using PyTorch's DDP module on Databricks with serverless GPU resources.

PyTorch DDP

Open notebook in new tab

Training OpenAI GPT-OSS 20B model on 8xH100 using TRL and DDP

This notebook demonstrates how to use the Serverless GPU Python API to run supervised fine-tuning (SFT) on the GPT-OSS 20B model from Hugging Face using the Transformer Reinforcement Learning (TRL) library. This example leverages DDP across all 8 H100 GPUs on the node to scale the global batch size.

TRL DDP

Open notebook in new tab

Distributed fine-tuning Llama 3.2 3B using Unsloth

This notebook demonstrates how to use the Serverless GPU Python API to fine-tune a Llama 3.2 3B model with Unsloth library across 8 A10 GPUs. Unsloth provides memory-efficient training optimizations and uses DDP under the hood via Hugging Face Accelerate.

Unsloth DDP

Open notebook in new tab

Distributed training using Ray Train (computer vision)

This notebook demonstrates distributed training of a PyTorch ResNet model on the FashionMNIST dataset using Ray Train and Ray Data on Databricks Serverless GPU clusters. Ray Train provides high-level distributed training orchestration and uses DDP as the underlying parallelism strategy. This example covers setting up Unity Catalog storage, configuring Ray for multi-node GPU training, logging and registering models with MLflow, and evaluating model performance.

Ray DDP

Open notebook in new tab

Training a two-tower recommender system using PyTorch Lightning

This notebook demonstrates how to train a two-tower recommendation model using PyTorch Lightning on serverless GPU compute. PyTorch Lightning provides a high-level interface that automatically handles DDP configuration for multi-GPU training. The example includes data preparation using Mosaic Streaming (MDS) format and distributed training across A10 or H100 GPUs.

See the Deep learning recommendation examples page for the complete notebooks, including:

  • Data preparation and MDS format conversion
  • Two-tower recommender training with PyTorch Lightning