Workload YAML reference
The AI Runtime CLI is in Beta.
This page is the reference for workload YAML configurations passed to air run --file.
The ground truth for YAML configuration is the in-CLI help. Run air -h config for the top-level view and air -h config.<section> (for example, air -h config.environment) for per-section detail.
Minimal configuration
experiment_name: my-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "Hello World"
Submit with:
air run --file train.yaml -p profile
Core concepts
Core fields
Most training configurations include five components:
experiment_name: Required. Creates or appends to an MLflow experiment.environment: Optional. Python dependencies and base environment.compute: Required. GPU resources (type and count).command: Required. The bash command or commands used to launch training.code_source: Optional. Path to your training code, made available remotely.
Your first training job
experiment_name: simple-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/train.py
In this configuration:
experiment_namecreates an MLflow experiment namedsimple-training(or appends a new run if it already exists).environmentinstalls dependencies fromrequirements.yaml.computeallocates one H100 node (8 H100 GPUs).code_sourceuploads the folderrepoto the node, available at$CODE_SOURCE_PATH.commandrunstrain.pyviatorchrunacross the 8 H100 GPUs. The file lives at/home/username/repo/train.pylocally.
Common use cases
Add environment variables
experiment_name: training-with-env
environment:
dependencies: requirements.yaml
env_variables:
BATCH_SIZE: '32'
LEARNING_RATE: '0.001'
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
Use secrets (API keys, tokens)
experiment_name: training-with-secrets
environment:
dependencies: requirements.yaml
secrets:
HF_TOKEN: 'my_scope/hf_token'
WANDB_API_KEY: 'my_scope/wandb'
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
Secrets use the format scope/key and must be configured in Databricks Secrets. See Secret management for setup.
When sharing a YAML template, other users must create their own secrets or have access to the referenced secret.
Work with code sources
The code_source block uploads local code so the training job can run it.
root_pathis the local directory to snapshot. By default,airpackages the working tree as-is (including any uncommitted changes) as a plain tarball.- To snapshot a pinned git version instead, add a
git:block with abranchorcommit. This requiresroot_pathto be a git repository and enables version-aware snapshotting (caching,git archive). - For large repositories,
include_pathslets you snapshot a subset.
Minimal example
experiment_name: simple-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
command: python $CODE_SOURCE_PATH/train.py
On the remote machine, the code is placed at /databricks/code_source/<directory_name>, where <directory_name> is the final path component of root_path. $CODE_SOURCE_PATH is set to that absolute path — use it in your command rather than hard-coding the location.
Git repositories: pin by branch or commit
For git repositories, add a git: block to pin the code version by branch or by commit SHA. branch and commit are mutually exclusive — specify exactly one within the block.
Pin to a branch (uses the local HEAD of that branch):
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main # Uses local HEAD of main (no remote fetch)
command: train.sh
Pin to a commit SHA (exact reproducibility):
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
commit: abc1234567 # Pins specific commit
command: train.sh
Key fields:
root_path(Required) — Local path to the root of your git repository.git.branch(Optional) — Branch name. Uses local HEAD; no remote fetch. Mutually exclusive withgit.commit.git.commit(Optional) — Specific commit SHA. Mutually exclusive withgit.branch.git.remote(Optional) — Use the branch's remote HEAD instead of the local one. Set totrueto auto-detect the remote, or to a remote name (for example,upstream) to fetch from a specific remote. Only valid withgit.branch.
If you omit the git: block, air packages the working tree as a plain tarball, including any uncommitted changes — no extra field is required.
Non-git directories
You can snapshot directories that aren't git repositories. Omit the git: block — it requires root_path to be a git repository. Without it, there is no version caching; a fresh tarball is uploaded for every run.
code_source:
type: snapshot
snapshot:
root_path: /home/username/my_project
command: $CODE_SOURCE_PATH/train.py
Folder filtering with include_paths
For large monorepos, snapshot only specific folders to reduce upload and download time and snapshot size:
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
include_paths:
- research/models
- research/common
- research/configs
command: python $CODE_SOURCE_PATH/research/models/launch_training.py
Key points:
- The field is optional. If omitted, the entire repository is included by default.
- Paths must be relative to the repository root (no leading
/). ..is not allowed; you cannot reference parent directories.
Advanced features
Custom hyperparameters
Pass structured configuration to your training script via HYPERPARAMETERS_PATH:
experiment_name: parameterized-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
parameters:
model:
name: 'gpt2'
hidden_size: 768
training:
batch_size: 32
learning_rate: 0.0001
Read them in your script:
import os
import yaml
with open(os.environ['HYPERPARAMETERS_PATH']) as f:
params = yaml.safe_load(f)
learning_rate = params['training']['learning_rate']
model_name = params['model']['name']
Job reliability
experiment_name: reliable-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
max_retries: 2
timeout_minutes: 90
If the workload fails, it is retried twice. Each attempt has 90 minutes to complete — the total wall-clock budget is 90 × 3 = 270 minutes.
Cost attribution
Attach a workload to an existing budget policy via usage_policy_id. For setup, see Attribute usage with serverless usage policies.
experiment_name: my-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "Hello World"
usage_policy_id: abcd123-25b8-3e87-9a2c-f86eb19d101c
Reference
Core fields
Field | Type | Description | Example |
|---|---|---|---|
| string | Experiment name for MLflow. |
|
| string | Path to |
|
| int | Number of GPUs. |
|
| string | GPU type. |
|
| dict | Code source configuration. | |
| string | Bash commands to launch training. |
|
Supported GPU types
| GPUs per node | Notes |
|---|---|---|
| 1 | Single A10 — good for development and small workloads. |
| 1 | Single H100. |
| 8 | Full H100 node — typical for distributed training. |
For accelerator capabilities and recommended use cases, see Hardware options.
Optional fields
Environment configuration
environment:
dependencies: requirements.yaml
env_variables:
BATCH_SIZE: '32'
secrets:
HF_TOKEN: 'my_scope/hf_token'
For the dependencies file format, see requirements.yaml reference.
Code source configuration
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo # REQUIRED — local path to repo or directory
git: # Optional (git repos only) — pin to a branch or commit
branch: main # Branch name; uses local HEAD unless 'remote' is set
# commit: abc1234567 # Mutually exclusive with 'branch'
remote: false # Optional — true to auto-detect remote HEAD, or a remote name string
include_paths: # Optional — filter included paths
- src/
- configs/
Field constraints:
git.branchandgit.commitare mutually exclusive — specify exactly one within thegit:block.git.remoterequiresgit.branch(it has no effect withgit.commit).- If you omit the
git:block, the working tree is packaged as a plain tarball, including any uncommitted changes.
Custom parameters
Passed to the workload via HYPERPARAMETERS_PATH:
parameters:
model:
name: 'gpt2'
hidden_size: 768
training:
batch_size: 32
MLflow run name
mlflow_run_name: 'experiment-001-baseline'
Path resolution
All paths in the workload YAML are relative to the workload YAML unless they are absolute paths.
Folder structure:
/home/username/my-project/
├── train.yaml
├── requirements.yaml
└── scripts/
└── train.py
YAML configuration:
experiment_name: my-training
environment:
dependencies: requirements.yaml # Relative to train.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: . # Relative to train.yaml
git:
branch: main
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/scripts/train.py