User guides for AI Runtime

Public Preview

AI Runtime for single-node tasks is in Public Preview. The distributed training API for multi-GPU workloads remain in Beta.

This page includes migration information, links to example notebooks, and troubleshooting information.

Migrating classic GPU workloads to serverless

If you are moving an existing deep learning workload from a classic Databricks cluster (with Databricks Runtime ML) to serverless (with AI Runtime), follow these steps:

Replace cluster-dependent code. Remove any references to Spark-based distributed training (for example, TorchDistributor) and replace them with the @distributed decorator from serverless_gpu.
Update data loading. Replace direct DBFS paths with Unity Catalog volumes paths (/Volumes/...). Replace local Spark DataFrame operations with Spark Connect.
Reinstall dependencies. Do not rely on Databricks Runtime ML pre-installed libraries. Add explicit %pip install commands for all required packages.
Update checkpoint paths. Move checkpoints from DBFS or local storage to Unity Catalog volumes (/Volumes/<catalog>/<schema>/<volume>/...).
Update MLflow configuration. Ensure experiment names use absolute paths and configure run names so they can be easily restarted.
Test interactively first. Validate your workload in an interactive notebook before scheduling it as a job.

Track usage and costs

You can monitor your AI Runtime GPU spend by querying the billable usage system table (system.billing.usage). The following query returns total usage for serverless GPU workloads:

SQL
SELECT
  SUM(usage_quantity)
FROM
  system.billing.usage
WHERE
  product_features.serverless_gpu IS NOT NULL

For more information about the billable usage table schema, see Billable usage system table reference.

AI Runtime charges per GPU hour on the Model Training SKU at the following prices:

H100 on demand: $7.00/GPU hour (US East)
A10 on demand: $2.50/GPU hour (US East)

Example notebooks

The following categories of example notebooks are available to help you get started:

Category	Description
Large Language Models (LLMs)	Fine-tuning large language models including parameter-efficient methods (LoRA, QLoRA)
Computer Vision	Object detection, image classification, and other CV tasks
Deep Learning Recommender Systems	Building recommendation systems using modern deep learning approaches like two-tower models
Classic ML	Traditional ML tasks including XGBoost model training and time series forecasting
Multi-GPU Distributed Training	Scaling training across multiple GPUs using the Serverless GPU API

For the full list, see AI Runtime example notebooks.

Troubleshooting

Genie Code can help diagnose and suggest fixes for library installation errors. See Use Genie Code to debug compute environment errors.

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

The error typically arises when there is a mismatch in the NumPy versions used during the compilation of a dependent package and the NumPy version currently installed in the runtime environment. This incompatibility often occurs due to changes in NumPy's C API and is particularly noticeable from NumPy 1.x to 2.x. This error indicates that the Python package installed in the notebook may have changed the NumPy version.

Recommended solution:

Check the NumPy version in the runtime and ensure it is compatible with your packages. See the Serverless GPU Compute release notes for environment 4 and environment 3 for information on preinstalled Python libraries. If you have a dependency on a different version of NumPy, add that dependency to your compute environment.

PyTorch cannot find libcudnn when installing torch

When you install a different version of torch, you might see the error: ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory. This is because torch only searches for the cuDNN library in the local path.

Recommended solution:

Reinstall the dependencies by adding --force-reinstall when installing torch:

Python
%pip install torch --force-reinstall

Migrating classic GPU workloads to serverless​

Track usage and costs​

Example notebooks​

Troubleshooting​

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject​

PyTorch cannot find libcudnn when installing torch​