Distributed training in notebooks

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

note

This page covers distributed training from Databricks notebooks with the serverless GPU Python API. To submit distributed training workloads from your local machine, use the AI Runtime CLI, which is in Public Preview. See AI Runtime CLI.

You can launch distributed workloads across multiple GPUs on a single node using the Serverless GPU Python API. The API provides a simple, unified interface that abstracts away the details of GPU provisioning, environment setup, and workload distribution. With minimal code changes, you can seamlessly move from single-GPU training to multi-GPU distributed execution from the same notebook.

note

Distributed training requires an 8xH100 accelerator, which provisions a single node with 8 GPUs. When using the @distributed decorator, set gpus=8. The gpu_type parameter is optional and automatically detected from the accelerator your notebook is connected to.

Supported frameworks

The @distributed API integrates with major distributed training libraries:

PyTorch Distributed Data Parallel (DDP): Standard multi-GPU data parallelism.
Fully Sharded Data Parallel (FSDP): Memory-efficient training for large models.
DeepSpeed: Microsoft's optimization library for large model training.

serverless_gpu API vs. TorchDistributor

The following table compares the serverless_gpu @distributed API with TorchDistributor:

Feature	`serverless_gpu` `@distributed` API	TorchDistributor
Infrastructure	Fully serverless, no cluster management	Requires a Spark cluster with GPU workers
Setup	Single decorator, minimal configuration	Requires Spark cluster and TorchDistributor setup
Framework support	PyTorch DDP, FSDP, DeepSpeed	Primarily PyTorch DDP
Data loading	Inside decorator, uses Unity Catalog Volumes (`UCVolumeDataset` for streaming file data)	Via Spark or filesystem

Feature	`serverless_gpu` `@distributed` API	TorchDistributor
Infrastructure	Fully serverless, no cluster management	Requires a Spark cluster with GPU workers
Setup	Single decorator, minimal configuration	Requires Spark cluster and TorchDistributor setup
Framework support	PyTorch DDP, FSDP, DeepSpeed	Primarily PyTorch DDP
Data loading	Inside decorator, uses Unity Catalog Volumes (`UCVolumeDataset` for streaming file data)	Via Spark or filesystem

The serverless_gpu API is the recommended approach for new deep learning workloads on Databricks. TorchDistributor remains available for workloads tightly coupled with Spark clusters.

Quick start

The serverless GPU API for distributed training is preinstalled when you are connected to a serverless GPU within Databricks notebooks and jobs. We recommend GPU environment 4 and above. To use it for distributed training, import and use the distributed decorator to distribute your training function.

Wrap the model training code in a function and decorate the function with the @distributed decorator. The decorated function becomes the entrypoint for distributed execution, so all training logic, data loading, and model initialization should be defined inside this function.

To launch distributed execution, call your decorated function using train_function.distributed(). Each call automatically creates an MLflow experiment run, or a nested child run if one is already active.

warning

If setting gpu_type in @distributed, make sure it matches the accelerator type your notebook is connected to ("H100" or "A10"). Specifying the wrong accelerator type will cause the workload to fail.

The code snippet below shows the basic usage of @distributed:

Python
from serverless_gpu import distributed

# Decorate your training function with @distributed and specify the number of GPUs.
# gpu_type='H100' is optional and will be auto-detected if not set.
@distributed(gpus=8, gpu_type='H100')
def run_train():
    ...

run_train.distributed()

Below is a full example that trains a multilayer perceptron (MLP) model on 8 H100 GPUs from a notebook:

Set up your model and define utility functions.

Python

# Define the model
import os
import torch
import torch.distributed as dist
import torch.nn as nn

def setup():
    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
    dist.init_process_group("nccl")

def cleanup():
    dist.destroy_process_group()

class SimpleMLP(nn.Module):
    def __init__(self, input_dim=10, hidden_dim=64, output_dim=1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

Import the serverless_gpu library and the distributed module.

Python
import serverless_gpu
from serverless_gpu import distributed

Wrap the model training code in a function and decorate the function with the @distributed decorator.

Python
@distributed(gpus=8, gpu_type='H100')
def run_train(num_epochs: int, batch_size: int) -> None:
    import mlflow
    import torch.optim as optim
    from torch.nn.parallel import DistributedDataParallel as DDP
    from torch.utils.data import DataLoader, DistributedSampler, TensorDataset

    # 1. Set up multi-GPU environment
    setup()
    device = torch.device(f"cuda:{int(os.environ['LOCAL_RANK'])}")

    # 2. Apply the Torch distributed data parallel (DDP) library for data-parellel training.
    model = SimpleMLP().to(device)
    model = DDP(model, device_ids=[device])

    # 3. Create and load dataset.
    x = torch.randn(5000, 10)
    y = torch.randn(5000, 1)

    dataset = TensorDataset(x, y)
    sampler = DistributedSampler(dataset)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=batch_size)

    # 4. Define the training loop.
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.MSELoss()

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)
        model.train()
        total_loss = 0.0
        for step, (xb, yb) in enumerate(dataloader):
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            loss = loss_fn(model(xb), yb)
            # Log loss to MLflow metric
            mlflow.log_metric("loss", loss.item(), step=step)

            loss.backward()
            optimizer.step()
            total_loss += loss.item() * xb.size(0)

        mlflow.log_metric("total_loss", total_loss)
        print(f"Total loss for epoch {epoch}: {total_loss}")

    cleanup()

Execute the distributed training by calling the distributed function with user-defined arguments.
Python
```
run_train.distributed(num_epochs=3, batch_size=1)
```
When executed, an MLflow run link is generated in the notebook cell output. Click the MLflow run link or find it in the Experiment panel to see the run results. For details on customizing experiment names, tracking metrics, and resuming runs, see Experiment tracking and observability.

Distributed execution details

Serverless GPU API consists of several key components:

Compute manager: Handles resource allocation and management
Runtime environment: Manages Python environments and dependencies
Launcher: Orchestrates job execution and monitoring

When running in distributed mode:

The function is serialized and distributed across the specified number of GPUs
Each GPU runs a copy of the function with the same parameters
The environment is synchronized across all GPUs
Results are collected and returned from all GPUs
Lifecycle management: The distributed execution runs within the lifecycle of the notebook. When the notebook terminates, so does the execution. The @distributed decorator has a default timeout of 3 hours. To set a custom timeout, pass timeout in seconds, or timeout=None to disable it. Setting a timeout is available in GPU environment v5 and above.

The API supports popular parallel training libraries such as Distributed Data Parallel (DDP), Fully Sharded Data Parallel (FSDP), DeepSpeed.

You can find more real distributed training scenarios using the various libraries in notebook examples.

FAQs

Where should the data loading code be placed?

When using the Serverless GPU API for distributed training, move data loading code inside the @distributed decorator. The dataset size can exceed the maximum size allowed by pickle, so it is recommended to generate the dataset inside the decorator, as shown below:

Python
from serverless_gpu import distributed

# this may cause pickle error
dataset = get_dataset(file_path)
@distributed(gpus=8, gpu_type='H100')
def run_train():
  # good practice
  dataset = get_dataset(file_path)
  ....

For file-based data stored in Unity Catalog volumes, use UCVolumeDataset from serverless_gpu.data, which streams files with local caching and partitions them across ranks and workers automatically. To checkpoint distributed training to a volume, use UCVolumeWriter and UCVolumeReader. See Load data on AI Runtime and Model checkpointing.

Learn more

For the API reference, refer to the Serverless GPU Python API documentation.

Supported frameworks​

serverless_gpu API vs. TorchDistributor​

Quick start​

Distributed execution details​

FAQs​

Where should the data loading code be placed?​

Learn more​