GPU-enabled compute

note

Some GPU-enabled instance types are in Beta and are marked as such in the drop-down list when you select the driver and worker types during compute creation.

Overview

Databricks supports compute accelerated with graphics processing units (GPUs). This article describes how to create compute with GPU-enabled instances and describes the GPU drivers and libraries installed on those instances.

To learn more about deep learning on GPU-enabled compute, see Deep learning.

Create a GPU compute

Creating a GPU compute is similar to creating any compute. Keep in mind the following:

The Machine learning checkbox must be checked. The GPU ML version is chosen automatically based on the worker type.
The Photon acceleration checkbox must be unchecked. Photon is not supported with GPU instance types.
The Worker type must be a GPU instance type.
The Single node checkbox can be checked to get a single GPU instance.

The process for configuring GPU instances using the Clusters API varies depending on whether the kind field is set. kind determines whether your request uses the simple form specification:

If kind = CLASSIC_PREVIEW, set "use_ml_runtime": true.
If you don't set the kind field, set spark_version to a GPU-enabled version, such as 15.4.x-gpu-ml-scala2.12.

Supported instance types

warning

Databricks is deprecating and will no longer support spinning up compute using Amazon EC2 P3 instances as AWS is deprecating these instances.

Databricks supports the following GPU-accelerated instance types:

GPU Type: NVIDIA H100 Tensor Core GPU

Instance Name	Number of GPUs	GPU Memory	vCPUs	CPU Memory
p5.48xlarge	8	80GBx8	192	2TB

Instance Name	Number of GPUs	GPU Memory	vCPUs	CPU Memory
p4d.24xlarge	8	40GBx8	96	1152GB
p4de.24xlarge	8	80GBx8	96	1152GB

GPU Type: NVIDIA L40S Tensor Core GPU

Instance Name	Number of GPUs	GPU Memory	vCPUs	CPU Memory
g6e.xlarge	1	48GB	4	32GB
g6e.2xlarge	1	48GB	8	64GB
g6e.4xlarge	1	48GB	16	128GB
g6e.8xlarge	1	48GB	32	256GB
g6e.16xlarge	1	48GB	64	512GB
g6e.12xlarge	4	48GB x 4	48	384GB
g6e.24xlarge	4	48GB x 4	96	768GB
g6e.48xlarge	8	48GB x 8	192	1536GB

GPU Type: NVIDIA L4 Tensor Core GPU

Instance Name	Number of GPUs	GPU Memory	vCPUs	CPU Memory
g6.xlarge	1	24GB	4	16GB
g6.2xlarge	1	24GB	8	32GB
g6.4xlarge	1	24GB	16	64GB
g6.8xlarge	1	24GB	32	128GB
g6.16xlarge	1	24GB	64	256GB
g6.12xlarge	4	24GB x 4	48	192GB
g6.24xlarge	4	24GB x 4	96	384GB
g6.48xlarge	8	24GB x 8	192	768GB

GPU Type: NVIDIA A10G Tensor Core GPU

Instance Name	Number of GPUs	GPU Memory	vCPUs	CPU Memory
g5.xlarge	1	24GB	4	16GB
g5.2xlarge	1	24GB	8	32GB
g5.4xlarge	1	24GB	16	64GB
g5.8xlarge	1	24GB	32	128GB
g5.16xlarge	1	24GB	64	256GB
g5.12xlarge	4	24GB x 4	48	192GB
g5.24xlarge	4	24GB x 4	96	384GB
g5.48xlarge	8	24GB x 8	192	768GB

GPU Type: NVIDIA T4 Tensor Core GPU

Instance Name	Number of GPUs	GPU Memory	vCPUs	CPU Memory
g4dn.xlarge	1	16GB	4	16GB
g4dn.2xlarge	1	16GB	8	32GB
g4dn.4xlarge	1	16GB	16	64GB
g4dn.8xlarge	1	16GB	32	128GB
g4dn.16xlarge	1	16GB	64	256GB
g4dn.12xlarge	4	24GB x 4	48	192GB

Instance Name	Number of GPUs	GPU Memory	vCPUs	CPU Memory
p3.2xlarge	1	16GB	8	61GB
p3.8xlarge	4	16GB x 4	32	244GB
p3.16xlarge	8	16GB x 8	64	488GB

Considerations

For all GPU-accelerated instance types, keep the following in mind:

Due to Amazon spot instance price surges, GPU spot instances are difficult to retain. Use on-demand if needed.
You might need to request a limit increase in order to create GPU-enabled compute.

See Supported Instance Types for a list of supported GPU instance types and their attributes.

GPU scheduling

GPU scheduling distributes Spark tasks efficiently across a large number of GPUs.

Databricks Runtime supports GPU-aware scheduling from Apache Spark 3.0. Databricks preconfigures it on GPU compute.

note

GPU scheduling is not enabled on single-node compute.

GPU scheduling for AI and ML

spark.task.resource.gpu.amount is the only Spark config related to GPU-aware scheduling that you may need to configure. The default configuration uses one GPU per task, which is a good baseline for distributed inference workloads and distributed training if you use all GPU nodes.

To reduce communication overhead during distributed training, Databricks recommends setting spark.task.resource.gpu.amount to the number of GPUs per worker node in the compute Spark configuration. This creates only one Spark task for each Spark worker and assigns all GPUs in that worker node to the same task.

To increase parallelization for distributed deep learning inference, you can set spark.task.resource.gpu.amount to fractional values such as 1/2, 1/3, 1/4, … 1/N. This creates more Spark tasks than there are GPUs, allowing more concurrent tasks to handle inference requests in parallel. For example, if you set spark.task.resource.gpu.amount to 0.5, 0.33, or 0.25, then the available GPUs will be split among double, triple, or quadruple the number of tasks.

GPU indices

For PySpark tasks, Databricks automatically remaps assigned GPU(s) to zero-based indices. For the default configuration that uses one GPU per task, you can use the default GPU without checking which GPU is assigned to the task. If you set multiple GPUs per task, for example, 4, the indices of the assigned GPUs are always 0, 1, 2, and 3. If you do need the physical indices of the assigned GPUs, you can get them from the CUDA_VISIBLE_DEVICES environment variable.

If you use Scala, you can get the indices of the GPUs assigned to the task from TaskContext.resources().get("gpu").

NVIDIA GPU driver, CUDA, and cuDNN

Databricks installs the NVIDIA driver and libraries required to use GPUs on Spark driver and worker instances:

CUDA Toolkit, installed under /usr/local/cuda.
cuDNN: NVIDIA CUDA Deep Neural Network Library.
NCCL: NVIDIA Collective Communications Library.

The version of the NVIDIA driver included is 535.54.03, which supports CUDA 11.0.

For the versions of the libraries included, see the release notes for the specific Databricks Runtime version you are using.

note

This software contains source code provided by NVIDIA Corporation. Specifically, to support GPUs, Databricks includes code from CUDA Samples.

NVIDIA End User License Agreement (EULA)

When you select a GPU-enabled “Databricks Runtime Version” in Databricks, you implicitly agree to the terms and conditions outlined in the NVIDIA EULA with respect to the CUDA, cuDNN, and Tesla libraries, and the NVIDIA End User License Agreement (with NCCL Supplement) for the NCCL library.

Databricks Container Services on GPU compute

Preview

This feature is in Public Preview.

You can use Databricks Container Services on compute with GPUs to create portable deep learning environments with customized libraries. See Customize containers with Databricks Container Service for instructions.

To create custom images for GPU compute, you must select a standard runtime version instead of Databricks Runtime ML for GPU. When you select Use your own Docker container, you can choose GPU compute with a standard runtime version. The custom images for GPU are based on the official CUDA containers, which is different from Databricks Runtime ML for GPU.

When you create custom images for GPU compute, you cannot change the NVIDIA driver version because it must match the driver version on the host machine.

The databricksruntime Docker Hub contains example base images with GPU capability. The Dockerfiles used to generate these images are located in the example containers GitHub repository, which also has details on what the example images provide and how to customize them.

Error messages

The following error indicates that the AWS cloud provider does not have enough capacity for the requested compute resource. Error: Cluster terminated. Reason: AWS Insufficient Instance Capacity Failure

To resolve this error, you can try creating compute in a different availability zone. The availability zone is in the compute configuration under Advanced > Access mode. You can also review AWS reserved instances pricing to purchase an additional quota.
If your compute uses P4d or G5 instance types and Databricks Runtime 7.3 LTS ML, the CUDA package version in 7.3 is incompatible with newer GPU instances. In those cases, ML packages such as TensorFlow Keras and PyTorch will produce errors such as:
- TensorFlow Keras: InternalError: CUDA runtime implicit initialization on GPU:x failed. Status: device kernel image is invalid
- PyTorch: UserWarning: NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
You can resolve these errors by upgrading to Databricks Runtime 10.4 LTS ML or above.

Overview​

Create a GPU compute​

Supported instance types​

Considerations​

GPU scheduling​

GPU scheduling for AI and ML​

GPU indices​

NVIDIA GPU driver, CUDA, and cuDNN​

NVIDIA End User License Agreement (EULA)​

Databricks Container Services on GPU compute​

Error messages​

Overview

Create a GPU compute

Supported instance types

Considerations

GPU scheduling

GPU scheduling for AI and ML

GPU indices

NVIDIA GPU driver, CUDA, and cuDNN

NVIDIA End User License Agreement (EULA)

Databricks Container Services on GPU compute

Error messages