Guides for Serverless GPU Compute
Migrating classic workloads to serverless
If you are moving an existing deep learning workload from a classic Databricks cluster (with Databricks Runtime ML) to Serverless GPU Compute, follow these steps:
- Replace cluster-dependent code. Remove any references to Spark-based distributed training (for example,
TorchDistributor) and replace them with the@distributeddecorator fromserverless_gpu. - Update data loading. Replace direct DBFS paths with Unity Catalog volumes paths (
/Volumes/...). Replace local Spark DataFrame operations with Spark Connect. - Reinstall dependencies. Do not rely on Databricks Runtime ML pre-installed libraries. Add explicit
%pip installcommands for all required packages. - Update checkpoint paths. Move checkpoints from DBFS or local storage to Unity Catalog volumes (
/Volumes/<catalog>/<schema>/<volume>/...). - Update MLflow configuration. Ensure experiment names use absolute paths and configure run names for repeatability.
- Test interactively first. Validate your workload in an interactive notebook before scheduling it as a job.
Example notebooks
The following categories of example notebooks are available to help you get started:
Category | Description |
|---|---|
Fine-tuning large language models including parameter-efficient methods (LoRA, QLoRA) | |
Object detection, image classification, and other CV tasks | |
Building recommendation systems using modern deep learning approaches like two-tower models | |
Traditional ML tasks including XGBoost model training and time series forecasting | |
Scaling training across multiple GPUs using the Serverless GPU API |
For the full list, see Serverless GPU compute example notebooks.
Troubleshooting
Genie Code can help diagnose and suggest fixes for library installation errors. See Use Genie Code to debug compute environment errors.
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
The error typically arises when there is a mismatch in the NumPy versions used during the compilation of a dependent package and the NumPy version currently installed in the runtime environment. This incompatibility often occurs due to changes in NumPy's C API and is particularly noticeable from NumPy 1.x to 2.x. This error indicates that the Python package installed in the notebook may have changed the NumPy version.
Recommended solution:
Check the NumPy version in the runtime and ensure it is compatible with your packages. See the Serverless GPU Compute release notes for environment 4 and environment 3 for information on preinstalled Python libraries. If you have a dependency on a different version of NumPy, add that dependency to your compute environment.
PyTorch cannot find libcudnn when installing torch
When you install a different version of torch, you might see the error: ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory. This is because torch only searches for the cuDNN library in the local path.
Recommended solution:
Reinstall the dependencies by adding --force-reinstall when installing torch:
%pip install torch --force-reinstall