GPU-enabled clusters
Note
Some GPU-enabled instance types are in Beta and are marked as such in the drop-down list when you select the driver and worker types during cluster creation.
Overview
Databricks supports clusters accelerated with graphics processing units (GPUs). This article describes how to create clusters with GPU-enabled instances and describes the GPU drivers and libraries installed on those instances.
To learn more about deep learning on GPU-enabled clusters, see Deep learning.
Create a GPU cluster
Creating a GPU cluster is similar to creating any Spark cluster. You should keep in mind the following:
The Databricks Runtime Version must be a GPU-enabled version, such as Runtime 13.3 LTS ML (GPU, Scala 2.12.15, Spark 3.4.1).
The Worker Type and Driver Type must be GPU instance types.
For single-machine workflows without Spark, you can set the number of workers to zero.
Supported instance types
Databricks supports the following GPU-accelerated instance types:
[Deprecated] P2 instance type series: p2.xlarge, p2.8xlarge, and p2.16xlarge
P2 instances are available only in select AWS regions. For information, see Amazon EC2 Pricing. Your Databricks deployment must reside in a supported region to launch GPU-enabled clusters.
P2 instances require EBS volumes for storage.
Warning
After August 31 2023, Databricks will no longer support spinning up clusters using Amazon EC2 P2 instances.
P3 instance type series: p3.2xlarge, p3.8xlarge, and p3.16xlarge.
P3 instances are available only in select AWS regions. For information, see Amazon EC2 Pricing. Your Databricks deployment must reside in a supported region to launch GPU-enabled clusters.
P4d instance type series: p4d.24xlarge, p4de.24xlarge.
P5 instance type series: p5.48xlarge.
G4 instance type series, which are optimized for deploying machine learning models in production.
G5 instance type series, which can be used for a wide range of graphics-intensive and machine learning use cases.
G5 instances require Databricks Runtime 9.1 LTS ML or above.
Considerations
For all GPU-accelerated instance types, keep the following in mind:
Due to Amazon spot instance price surges, GPU spot instances are difficult to retain. Use on-demand if needed.
You might need to request a limit increase in order to create GPU-enabled clusters.
See Supported Instance Types for a list of supported GPU instance types and their attributes.
GPU scheduling
Databricks Runtime supports GPU-aware scheduling from Apache Spark 3.0. Databricks preconfigures it on GPU clusters.
GPU scheduling is not enabled on Single Node clusters.
spark.task.resource.gpu.amount
is the only Spark config related to GPU-aware scheduling that you might need to change.
The default configuration uses one GPU per task, which is ideal for distributed inference workloads and distributed training, if you use all GPU nodes.
To do distributed training on a subset of nodes, which helps reduce communication overhead during distributed training, Databricks recommends setting spark.task.resource.gpu.amount
to the number of GPUs per worker node
in the cluster Spark configuration.
For PySpark tasks, Databricks automatically remaps assigned GPU(s) to indices 0, 1, ….
Under the default configuration that uses one GPU per task, your code can simply use the default GPU without checking which GPU is assigned to the task.
If you set multiple GPUs per task, for example 4, your code can assume that the indices of the assigned GPUs are always 0, 1, 2, and 3. If you do need the physical indices of the assigned GPUs, you can get them from the CUDA_VISIBLE_DEVICES
environment variable.
If you use Scala, you can get the indices of the GPUs assigned to the task from TaskContext.resources().get("gpu")
.
For Databricks Runtime releases below 7.0, to avoid conflicts among multiple Spark tasks trying to use the same GPU, Databricks automatically configures GPU clusters so that there is at most one running task per node. That way the task can use all GPUs on the node without running into conflicts with other tasks.
NVIDIA GPU driver, CUDA, and cuDNN
Databricks installs the NVIDIA driver and libraries required to use GPUs on Spark driver and worker instances:
CUDA Toolkit, installed under
/usr/local/cuda
.cuDNN: NVIDIA CUDA Deep Neural Network Library.
NCCL: NVIDIA Collective Communications Library.
The version of the NVIDIA driver included is 470.57.02, which supports CUDA 11.0.
For the versions of the libraries included, see the release notes for the specific Databricks Runtime version you are using.
Note
This software contains source code provided by NVIDIA Corporation. Specifically, to support GPUs, Databricks includes code from CUDA Samples.
NVIDIA End User License Agreement (EULA)
When you select a GPU-enabled “Databricks Runtime Version” in Databricks, you implicitly agree to the terms and conditions outlined in the NVIDIA EULA with respect to the CUDA, cuDNN, and Tesla libraries, and the NVIDIA End User License Agreement (with NCCL Supplement) for the NCCL library.
Databricks Container Services on GPU clusters
Preview
This feature is in Public Preview.
You can use Databricks Container Services on clusters with GPUs to create portable deep learning environments with customized libraries. See Customize containers with Databricks Container Services for instructions.
To create custom images for GPU clusters, you must select a standard runtime version instead of Databricks Runtime ML for GPU. When you select Use your own Docker container, you can choose GPU clusters with a standard runtime version. The custom images for GPU clusters are based on the official CUDA containers, which is different from Databricks Runtime ML for GPU.
When you create custom images for GPU clusters, you cannot change the NVIDIA driver version, because it must match the driver version on the host machine.
The databricksruntime
Docker Hub contains example base images with GPU capability. The Dockerfiles used to generate these images are located in the example containers GitHub repository, which also has details on what the example images provide and how to customize them.
Error messages
The following error indicates that the AWS cloud provider does not have enough capacity for the requested compute resource.
Error: Cluster terminated. Reason: AWS Insufficient Instance Capacity Failure
To resolve, you can try to create a cluster in a different availability zone. The availability zone is in the cluster configuration, under Advanced options. You can also review AWS reserved instances pricing to purchase additional quota.
If your cluster uses P4d or G5 instance types and Databricks Runtime 7.3 LTS ML, the CUDA package version in 7.3 is incompatible with newer GPU instances. In those cases, ML packages such as TensorFlow Keras and PyTorch will produce errors such as:
TensorFlow Keras:
InternalError: CUDA runtime implicit initialization on GPU:x failed. Status: device kernel image is invalid
PyTorch:
UserWarning: NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
You can resolve these errors by upgrading to Databricks Runtime 10.4 LTS ML or above.