Best practices for Serverless GPU compute

This article presents you with best practice recommendations for using serverless GPU compute in your notebooks and jobs.

By following these recommendations, you will enhance the productivity, cost efficiency, and reliability of your workloads on Databricks.

Use the right compute

Use Serverless GPU compute. This option comes with torch, cuda, and torchvision optimized for compatibility. Exact package versions will depend on the environment versions.
Select your accelerator in the environment side panel.
- For remote distributed training workloads, use an A10 GPU, which will be the client to send a job to the remote H100 later.
- For running large interactive jobs on the notebook itself, you can attach your notebook to H100, which will take up 1 node (8 H100 GPUs).
To avoid taking up GPUs, you can attach your notebook to a CPU cluster for some operations like git clone and converting Spark Dataframe to Mosaic Data Shard (MDS) format.

MLflow recommendations

For an optimal ML development cycle, use MLflow 3 on Databricks. Follow these tips:

Upgrade your environment's MLflow to version 3.6 or newer and follow the MLflow deep learning flow in MLflow 3 deep learning workflow.
Set the step parameter in MLFlowLogger to a reasonable number of batches. MLflow has a limit of 10 million metric steps that can be logged. See Resource limits.
Enable mlflow.pytorch.autolog() if Pytorch Lightning is used as the trainer.
Customize your MLflow run name by encapsulating your model training code within the mlflow.start_run() API scope. This gives you control over the run name and enables you to restart from a previous run.You can customize the run name using the run_name parameter in mlflow.start_run(run_name="your-custom-name") or in third-party libraries that support MLflow (for example, Hugging Face Transformers). Otherwise, the default run name is jobTaskRun-xxxxx.
Python
```
from transformers import TrainingArguments
args = TrainingArguments(
    report_to="mlflow",
    run_name="llama7b-sft-lr3e5",  # <-- MLflow run name
    logging_steps=50,
)
```
The serverless GPU API launches an MLflow experiment to log system metrics. By default, it uses the name /Users/{WORKSPACE_USER}/{get_notebook_name()} unless the user overwrites it with the environment variable MLFLOW_EXPERIMENT_NAME.
- When setting the MLFLOW_EXPERIMENT_NAME environment variable, use an absolute path. For example,/Users/<username>/my-experiment.
- The experiment name must not contain the existing folder name. E.g. if my-experiment is an existing folder, the above example will error out.
Python
```
import os
from serverless_gpu import distributed
os.environ['MLFLOW_EXPERIMENT_NAME'] = '/Users/{WORKSPACE_USER}/my_experiment'
@distributed(gpus=num_gpus, gpu_type=gpu_type, remote=True)
def run_train():
# my training code
```
To resume training from a previous run, specify the MLFLOW_RUN_ID from the previous run as follows.
Python
```
import os
os.environ[‘MLFLOW_RUN_ID’] = <previous_run_id>
run_train.distributed()
```

Multi-user Collaboration

To ensure all users can access shared code (e.g., helper modules, environment.yaml), create git folders in /Workspace/Repos or /Workspace/Shared instead of user-specific folders like /Workspace/Users/<your_email>/.
For code that is in active development, use Git folders in user-specific folders /Workspace/Users/<your_email>/ and push to remote Git repos. This allows multiple users to have a user-specific clone (and branch) but still use a remote Git repo for version control. See best practices for using Git on Databricks.
Collaborators can share and comment on notebooks.

When using the Serverless GPU API for distributed training, move data loading code inside the @distributed decorator

The dataset size can exceed the maximum size allowed by pickle. Therefore, it is recommended to have the dataset generated inside the decorator, like so:

Python
from serverless_gpu import distributed

# this may cause pickle error
dataset = get_dataset(file_path)
@distributed(gpus=8, remote=True)
def run_train():
  # good practice
  dataset = get_dataset(file_path)
  ....

Global limits in Databricks

See Resource limits.

Use the right compute​

MLflow recommendations​

Multi-user Collaboration​

When using the Serverless GPU API for distributed training, move data loading code inside the @distributed decorator​

Global limits in Databricks​

Use the right compute

MLflow recommendations

Multi-user Collaboration

When using the Serverless GPU API for distributed training, move data loading code inside the @distributed decorator

Global limits in Databricks