AI Runtime CLI examples
Beta
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
The following examples are complete, end-to-end workloads you submit from the air CLI
with air run -f train.yaml. Each shows a real distributed-training pattern on H100 GPUs,
including the workload YAML, launcher script, and training code. Start with the
quickstart if you haven't submitted a run before.
-
- Multi-node LLM fine-tuning with FSDP
- Supervised fine-tuning of Llama-3.1-8B across 16 H100 GPUs (2 nodes) using
torchrunand PyTorch Fully Sharded Data Parallel (FSDP). Logs to MLflow and checkpoints to a Unity Catalog volume.
-
- Distributed training with Ray Train
- Distributed data-parallel fine-tuning with Ray Train's
TorchTraineracross 8 H100 GPUs on a single node, with one worker per GPU.