Skip to main content

Workload YAML reference

Beta

The AI Runtime CLI is in Beta.

This page is the reference for workload YAML configurations passed to air run --file.

note

The ground truth for YAML configuration is the in-CLI help. Run air -h config for the top-level view and air -h config.<section> (for example, air -h config.environment) for per-section detail.

Minimal configuration

YAML
experiment_name: my-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "Hello World"

Submit with:

Bash
air run --file train.yaml -p profile

Core concepts

Core fields

Most training configurations include five components:

  1. experiment_name: Required. Creates or appends to an MLflow experiment.
  2. environment: Optional. Python dependencies and base environment.
  3. compute: Required. GPU resources (type and count).
  4. command: Required. The bash command or commands used to launch training.
  5. code_source: Optional. Path to your training code, made available remotely.

Your first training job

YAML
experiment_name: simple-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/train.py

In this configuration:

  • experiment_name creates an MLflow experiment named simple-training (or appends a new run if it already exists).
  • environment installs dependencies from requirements.yaml.
  • compute allocates one H100 node (8 H100 GPUs).
  • code_source uploads the folder repo to the node, available at $CODE_SOURCE_PATH.
  • command runs train.py via torchrun across the 8 H100 GPUs. The file lives at /home/username/repo/train.py locally.

Common use cases

Add environment variables

YAML
experiment_name: training-with-env
environment:
dependencies: requirements.yaml
env_variables:
BATCH_SIZE: '32'
LEARNING_RATE: '0.001'
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py

Use secrets (API keys, tokens)

YAML
experiment_name: training-with-secrets
environment:
dependencies: requirements.yaml
secrets:
HF_TOKEN: 'my_scope/hf_token'
WANDB_API_KEY: 'my_scope/wandb'
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py

Secrets use the format scope/key and must be configured in Databricks Secrets. See Secret management for setup.

When sharing a YAML template, other users must create their own secrets or have access to the referenced secret.

Work with code sources

The code_source block uploads local code so the training job can run it.

  • root_path is the local directory to snapshot. By default, air packages the working tree as-is (including any uncommitted changes) as a plain tarball.
  • To snapshot a pinned git version instead, add a git: block with a branch or commit. This requires root_path to be a git repository and enables version-aware snapshotting (caching, git archive).
  • For large repositories, include_paths lets you snapshot a subset.

Minimal example

YAML
experiment_name: simple-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
command: python $CODE_SOURCE_PATH/train.py

On the remote machine, the code is placed at /databricks/code_source/<directory_name>, where <directory_name> is the final path component of root_path. $CODE_SOURCE_PATH is set to that absolute path — use it in your command rather than hard-coding the location.

Git repositories: pin by branch or commit

For git repositories, add a git: block to pin the code version by branch or by commit SHA. branch and commit are mutually exclusive — specify exactly one within the block.

Pin to a branch (uses the local HEAD of that branch):

YAML
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main # Uses local HEAD of main (no remote fetch)
command: train.sh

Pin to a commit SHA (exact reproducibility):

YAML
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
commit: abc1234567 # Pins specific commit
command: train.sh

Key fields:

  • root_path (Required) — Local path to the root of your git repository.
  • git.branch (Optional) — Branch name. Uses local HEAD; no remote fetch. Mutually exclusive with git.commit.
  • git.commit (Optional) — Specific commit SHA. Mutually exclusive with git.branch.
  • git.remote (Optional) — Use the branch's remote HEAD instead of the local one. Set to true to auto-detect the remote, or to a remote name (for example, upstream) to fetch from a specific remote. Only valid with git.branch.

If you omit the git: block, air packages the working tree as a plain tarball, including any uncommitted changes — no extra field is required.

Non-git directories

You can snapshot directories that aren't git repositories. Omit the git: block — it requires root_path to be a git repository. Without it, there is no version caching; a fresh tarball is uploaded for every run.

YAML
code_source:
type: snapshot
snapshot:
root_path: /home/username/my_project
command: $CODE_SOURCE_PATH/train.py

Folder filtering with include_paths

For large monorepos, snapshot only specific folders to reduce upload and download time and snapshot size:

YAML
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
include_paths:
- research/models
- research/common
- research/configs
command: python $CODE_SOURCE_PATH/research/models/launch_training.py

Key points:

  • The field is optional. If omitted, the entire repository is included by default.
  • Paths must be relative to the repository root (no leading /).
  • .. is not allowed; you cannot reference parent directories.

Advanced features

Custom hyperparameters

Pass structured configuration to your training script via HYPERPARAMETERS_PATH:

YAML
experiment_name: parameterized-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
parameters:
model:
name: 'gpt2'
hidden_size: 768
training:
batch_size: 32
learning_rate: 0.0001

Read them in your script:

Python
import os
import yaml

with open(os.environ['HYPERPARAMETERS_PATH']) as f:
params = yaml.safe_load(f)

learning_rate = params['training']['learning_rate']
model_name = params['model']['name']

Job reliability

YAML
experiment_name: reliable-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo
git:
branch: main
command: torchrun --nproc_per_node=8 train.py
max_retries: 2
timeout_minutes: 90

If the workload fails, it is retried twice. Each attempt has 90 minutes to complete — the total wall-clock budget is 90 × 3 = 270 minutes.

Cost attribution

Attach a workload to an existing budget policy via usage_policy_id. For setup, see Attribute usage with serverless usage policies.

YAML
experiment_name: my-training
environment:
dependencies: requirements.yaml
compute:
num_accelerators: 1
accelerator_type: GPU_1xA10
command: echo "Hello World"
usage_policy_id: abcd123-25b8-3e87-9a2c-f86eb19d101c

Reference

Core fields

Field

Type

Description

Example

experiment_name

string

Experiment name for MLflow.

"my-training-job"

environment.dependencies

string

Path to requirements.yaml.

"requirements.yaml"

compute.num_accelerators

int

Number of GPUs.

1, 4, 8

compute.accelerator_type

string

GPU type.

"GPU_1xA10", "GPU_8xH100"

code_source

dict

Code source configuration.

See Work with code sources.

command

string

Bash commands to launch training.

torchrun --nproc_per_node=8 train.py

Supported GPU types

accelerator_type

GPUs per node

Notes

GPU_1xA10

1

Single A10 — good for development and small workloads.

GPU_1xH100

1

Single H100.

GPU_8xH100

8

Full H100 node — typical for distributed training.

For accelerator capabilities and recommended use cases, see Hardware options.

Optional fields

Environment configuration

YAML
environment:
dependencies: requirements.yaml
env_variables:
BATCH_SIZE: '32'
secrets:
HF_TOKEN: 'my_scope/hf_token'

For the dependencies file format, see requirements.yaml reference.

Code source configuration

YAML
code_source:
type: snapshot
snapshot:
root_path: /home/username/repo # REQUIRED — local path to repo or directory
git: # Optional (git repos only) — pin to a branch or commit
branch: main # Branch name; uses local HEAD unless 'remote' is set
# commit: abc1234567 # Mutually exclusive with 'branch'
remote: false # Optional — true to auto-detect remote HEAD, or a remote name string
include_paths: # Optional — filter included paths
- src/
- configs/

Field constraints:

  • git.branch and git.commit are mutually exclusive — specify exactly one within the git: block.
  • git.remote requires git.branch (it has no effect with git.commit).
  • If you omit the git: block, the working tree is packaged as a plain tarball, including any uncommitted changes.

Custom parameters

Passed to the workload via HYPERPARAMETERS_PATH:

YAML
parameters:
model:
name: 'gpt2'
hidden_size: 768
training:
batch_size: 32

MLflow run name

YAML
mlflow_run_name: 'experiment-001-baseline'

Path resolution

All paths in the workload YAML are relative to the workload YAML unless they are absolute paths.

Folder structure:

/home/username/my-project/
├── train.yaml
├── requirements.yaml
└── scripts/
└── train.py

YAML configuration:

YAML
experiment_name: my-training
environment:
dependencies: requirements.yaml # Relative to train.yaml
compute:
num_accelerators: 8
accelerator_type: GPU_8xH100
code_source:
type: snapshot
snapshot:
root_path: . # Relative to train.yaml
git:
branch: main
command: torchrun --nproc_per_node=8 $CODE_SOURCE_PATH/scripts/train.py