Workload YAML reference
The AI Runtime CLI is in Beta.
This page is the reference for workload YAML configurations passed to air run --file.
The ground truth for YAML configuration is the in-CLI help. Run air -h config for the top-level view and air -h config.<section> (for example, air -h config.environment) for per-section detail.
Minimal configuration
experiment_name: my-training
environment:
dependencies:
- mlflow
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "Hello World"
Submit with:
air run --file train.yaml -p profile
Core concepts
Core fields
Most training configurations include five components:
experiment_name: Required. Creates or appends to an MLflow experiment.environment: Optional. Python dependencies and base environment.compute: Required. GPU resources (type and count).command: Required. The bash command or commands used to launch training.code_source: Optional. Path to your training code, made available remotely.
Your first training job
experiment_name: simple-training
environment:
dependencies:
- torch
- transformers
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/train.py
In this configuration:
experiment_namecreates an MLflow experiment namedsimple-training(or appends a new run if it already exists).environmentinstalls the listed Python dependencies (here,torchandtransformers).computeallocates one H100 node (8 H100 GPUs).code_sourceuploads the folderrepoto the node, available at$CODE_SOURCE_PATH.commandrunstrain.pyviatorchrunacross the 8 H100 GPUs. The file lives at/home/username/repo/train.pylocally.
Common use cases
Add environment variables
experiment_name: training-with-env
environment:
dependencies:
- torch
- transformers
env_variables:
BATCH_SIZE: '32'
LEARNING_RATE: '0.001'
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
Use secrets (API keys, tokens)
experiment_name: training-with-secrets
environment:
dependencies:
- torch
- transformers
secrets:
HF_TOKEN: 'my_scope/hf_token'
WANDB_API_KEY: 'my_scope/wandb'
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
Secrets use the format scope/key and must be configured in Databricks Secrets. See Secret management for setup.
When sharing a YAML template, other users must create their own secrets or have access to the referenced secret.
Python dependencies
List your workload's Python dependencies as an inline list under environment.dependencies:
environment:
version: '4'
dependencies:
- torch
- transformers
environment.version selects the serverless GPU environment version. It is optional and defaults to "4".
Dependency format
The dependency list follows the Databricks Base Environment Specification. Each entry is a pip-style package spec (for example, my-library==6.1). The list also accepts the following entries:
- Requirements files: a reference to an existing
requirements.txtusing-r, for example-r '/Workspace/Shared/requirements.txt'. Environment variables such as$HOMEare expanded. - Wheels: an absolute path to a
.whlfile, for example/Workspace/Shared/path/to/simplejson-3.19.3-py3-none-any.whl. - Index URLs: an index URL, for example
--index-url https://pypi.org/simple.
environment:
version: '4'
dependencies:
- --index-url https://pypi.org/simple
- -r '/Workspace/Shared/requirements.txt'
- my-library==6.1
- /Workspace/Shared/path/to/simplejson-3.19.3-py3-none-any.whl
Supported install flags
Dependencies are installed with uv. The following pip-style flags are supported as list entries:
- Applied to the whole install:
--index-url,--extra-index-url, and--find-links(-f) set or extend the package indexes. - Applied to the dependency that follows them:
--no-deps,--no-build-isolation,--no-cache-dir, and--force-reinstall. Place the flag on its own line (or before the spec), followed by the dependency it applies to.
For example, to install flash-attn against the already-installed torch (no build isolation) and without resolving its own dependencies:
environment:
version: '4'
dependencies:
- torch
- --no-build-isolation
- --no-deps
- flash-attn
--trusted-host is not supported. Because uv configures trust per index URL, use --index-url or --extra-index-url instead.
Work with code sources
The code_source block uploads local code so the training job can run it.
root_pathis the local directory to snapshot. By default,airpackages the working tree as-is (including any uncommitted changes) as a plain tarball.- To snapshot a pinned git version instead, add a
git:block with abranchorcommit. This requiresroot_pathto be a git repository and enables version-aware snapshotting (caching,git archive). - For large repositories,
include_pathslets you snapshot a subset.
Minimal example
experiment_name: simple-training
environment:
dependencies:
- torch
- transformers
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
command: python $CODE_SOURCE_PATH/train.py
On the remote machine, the code is placed at /databricks/code_source/<directory_name>, where <directory_name> is the final path component of root_path. $CODE_SOURCE_PATH is set to that absolute path — use it in your command rather than hard-coding the location.
Git repositories: pin by branch or commit
For git repositories, add a git: block to pin the code version by branch or by commit SHA. branch and commit are mutually exclusive — specify exactly one within the block.
Pin to a branch (uses the local HEAD of that branch):
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main # Uses local HEAD of main (no remote fetch)
command: train.sh
Pin to a commit SHA (exact reproducibility):
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
commit: abc1234567 # Pins specific commit
command: train.sh
Key fields:
root_path(Required) — Local path to the root of your git repository.git.branch(Optional) — Branch name. Uses local HEAD; no remote fetch. Mutually exclusive withgit.commit.git.commit(Optional) — Specific commit SHA. Mutually exclusive withgit.branch.git.remote(Optional) — Use the branch's remote HEAD instead of the local one. Set totrueto auto-detect the remote, or to a remote name (for example,upstream) to fetch from a specific remote. Only valid withgit.branch.
If you omit the git: block, air packages the working tree as a plain tarball, including any uncommitted changes — no extra field is required.
Non-git directories
You can snapshot directories that aren't git repositories. Omit the git: block — it requires root_path to be a git repository. Without it, there is no version caching; a fresh tarball is uploaded for every run.
code_source:
type: snapshot
snapshot:
root_path: /home/username/my_project
command: $CODE_SOURCE_PATH/train.py
Folder filtering with include_paths
For large monorepos, snapshot only specific folders to reduce upload and download time and snapshot size:
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
include_paths:
- research/models
- research/common
- research/configs
command: python $CODE_SOURCE_PATH/research/models/launch_training.py
Key points:
- The field is optional. If omitted, the entire repository is included by default.
- Paths must be relative to the repository root (no leading
/). ..is not allowed; you cannot reference parent directories.
Advanced features
Custom hyperparameters
Pass structured configuration to your training script via HYPERPARAMETERS_PATH:
experiment_name: parameterized-training
environment:
dependencies:
- torch
- transformers
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
parameters:
model:
name: 'gpt2'
hidden_size: 768
training:
batch_size: 32
learning_rate: 0.0001
Read them in your script:
import os
import yaml
with open(os.environ['HYPERPARAMETERS_PATH']) as f:
params = yaml.safe_load(f)
learning_rate = params['training']['learning_rate']
model_name = params['model']['name']
Job reliability
experiment_name: reliable-training
environment:
dependencies:
- torch
- transformers
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
max_retries: 2
timeout_minutes: 90
If the workload fails, it is retried twice. Each attempt has 90 minutes to complete — the total wall-clock budget is 90 × 3 = 270 minutes.
Cost attribution
Attach a workload to an existing budget policy via usage_policy_id. For setup, see Attribute usage with serverless usage policies.
experiment_name: my-training
environment:
dependencies:
- mlflow
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "Hello World"
usage_policy_id: abcd123-25b8-3e87-9a2c-f86eb19d101c
Reference
Core fields
Field | Type | Description | Example |
|---|---|---|---|
| string | Experiment name for MLflow. |
|
| list | Inline list of pip dependency specs. |
|
| string | Serverless GPU environment version. Optional. Defaults to |
|
| int | Number of GPUs. |
|
| string | GPU type. |
|
| dict | Code source configuration. | |
| string | Bash commands to launch training. |
|
Supported GPU types
| GPUs per node | Notes |
|---|---|---|
| 1 | Single A10 — good for development and small workloads. |
| 1 | Single H100. |
| 8 | Full H100 node — typical for distributed training. |
For accelerator capabilities and recommended use cases, see Hardware options.
Optional fields
Environment configuration
environment:
version: '4'
dependencies:
- torch
- transformers
env_variables:
BATCH_SIZE: '32'
secrets:
HF_TOKEN: 'my_scope/hf_token'
For the dependency format, supported install flags, and environment.version, see Python dependencies.
Code source configuration
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo # REQUIRED — local path to repo or directory
git: # Optional (git repos only) — pin to a branch or commit
branch: main # Branch name; uses local HEAD unless 'remote' is set
# commit: abc1234567 # Mutually exclusive with 'branch'
remote: false # Optional — true to auto-detect remote HEAD, or a remote name string
include_paths: # Optional — filter included paths
- src/
- configs/
Field constraints:
git.branchandgit.commitare mutually exclusive — specify exactly one within thegit:block.git.remoterequiresgit.branch(it has no effect withgit.commit).- If you omit the
git:block, the working tree is packaged as a plain tarball, including any uncommitted changes.
Custom parameters
Passed to the workload via HYPERPARAMETERS_PATH:
parameters:
model:
name: 'gpt2'
hidden_size: 768
training:
batch_size: 32
MLflow run name
mlflow_run_name: 'experiment-001-baseline'
Path resolution
All paths in the workload YAML are relative to the workload YAML unless they are absolute paths.
Folder structure:
/home/username/my-project/
├── train.yaml
└── scripts/
└── train.py
YAML configuration:
experiment_name: my-training
environment:
dependencies:
- torch
- transformers
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: . # Relative to train.yaml
git:
branch: main
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/scripts/train.py