Best practices for Serverless GPU compute
This article presents you with best practice recommendations for using serverless GPU compute in your notebooks and jobs.
By following these recommendations, you will enhance the productivity, cost efficiency, and reliability of your workloads on Databricks.
Move data loading code inside the @distributed decorator
The dataset size can exceed the maximum size allowed by pickle. Therefore, it is recommended to have the dataset generated inside the decorator, like so:
from serverless_gpu import distributed
# this may cause pickle error
dataset = get_dataset(file_path)
@distributed(gpus=8, remote=True)
def run_train():
# good practice
dataset = get_dataset(file_path)
....
Use the right compute
- Use Serverless GPU compute. This option comes with torch, cuda, and torchvision optimized for compatibility. Exact package versions will depend on the environment versions.
- Select your accelerator in the environment side panel.
- For remote distributed training workloads, use an A10 GPU, which will be the client to send a job to the remote H100 later.
- For running large interactive jobs on the notebook itself, you can attach your notebook to H100, which will take up 1 node (8 H100 GPUs).
- To avoid taking up GPUs, you can attach your notebook to a CPU cluster for some operations like git clone and MDS conversion.
MLflow recommendations
For an optimal ML development cycle, use MLflow 3 on Databricks. Follow these tips:
-
Upgrade your environment's MLflow to version 3.0 or newer and follow the MLflow deep learning flow in MLflow 3 deep learning workflow.
-
Set the
stepparameter inMLFlowLoggerto a reasonable number of batches. MLflow has a limit of 10 million metric steps that can be logged. See Resource limits. -
Enable
mlflow.pytorch.autolog()if Pytorch Lightning is used as the trainer. -
Customize your MLflow run name by encapsulating your model training code within the
mlflow.start_run()API scope. This gives you control over the run name and enables you to restart from a previous run.You can customize the run name using therun_nameparameter inmlflow.start_run(run_name="your-custom-name")or in third-party libraries that support MLflow (for example, Hugging Face Transformers). Otherwise, the default run name isjobTaskRun-xxxxx.Pythonfrom transformers import TrainingArguments
args = TrainingArguments(
report_to="mlflow",
run_name="llama7b-sft-lr3e5", # <-- MLflow run name
logging_steps=50,
) -
The serverless GPU API launches an MLflow experiment to log system metrics. By default, it uses the name
/Users/{WORKSPACE_USER}/{get_notebook_name()}unless the user overwrites it with the environment variableMLFLOW_EXPERIMENT_NAME.- When setting the
MLFLOW_EXPERIMENT_NAMEenvironment variable, use an absolute path. For example,/Users/<username>/my-experiment. - The experiment name must not contain the existing folder name. E.g. if
my-experimentis an existing folder, the above example will error out.
Pythonimport os
from serverless_gpu import distributed
os.environ['MLFLOW_EXPERIMENT_NAME'] = '/Users/{WORKSPACE_USER}/my_experiment'
@distributed(gpus=num_gpus, gpu_type=gpu_type, remote=True)
def run_train():
# my training code - When setting the
-
To resume training from a previous run, specify the MLFLOW_RUN_ID from the previous run as follows.
Pythonimport os
os.environ[‘MLFLOW_RUN_ID’] = <previous_run_id>
run_train.distributed()
Multi-user Collaboration
- To ensure all users can access shared code (e.g., helper modules, environment.yaml), create git folders in
/Workspace/Reposor/Workspace/Sharedinstead of user-specific folders like/Workspace/Users/<your_email>/. - For code that is in active development, use Git folders in user-specific folders
/Workspace/Users/<your_email>/and push to remote Git repos. This allows multiple users to have a user-specific clone (and branch) but still use a remote Git repo for version control. See best practices for using Git on Databricks. - Collaborators can share and comment on notebooks.
Global limits in Databricks
See Resource limits.