PyTorch

PyTorch project is a Python package that provides GPU accelerated tensor computation and high level functionalities for building deep learning networks. For licensing details, see the PyTorch license doc on GitHub.

To monitor and debug your PyTorch models, consider using TensorBoard.

PyTorch is included in Databricks Runtime for Machine Learning. If you are using Databricks Runtime, see Install PyTorch for instructions on installing PyTorch.

note

This is not a comprehensive guide to PyTorch. For more information, see the PyTorch website.

Single node and distributed training

To test and migrate single-machine workflows, use a Single Node cluster.

For distributed training options for deep learning, see Distributed training.

Example notebooks

Install PyTorch

Databricks Runtime for ML

Databricks Runtime for Machine Learning includes PyTorch so you can create the cluster and start using PyTorch. For the version of PyTorch installed in the Databricks Runtime ML version you are using, see the release notes.

Databricks Runtime

Databricks recommends that you use the PyTorch included in Databricks Runtime for Machine Learning. However, if you must use the standard Databricks Runtime, PyTorch can be installed as a Databricks PyPI library. The following example shows how to install PyTorch 1.5.0:

On GPU clusters, install pytorch and torchvision by specifying the following:
- torch==1.5.0
- torchvision==0.6.0

On CPU clusters, install pytorch and torchvision by using the following Python wheel files:

https://download.pytorch.org/whl/cpu/torch-1.5.0%2Bcpu-cp37-cp37m-linux_x86_64.whl

https://download.pytorch.org/whl/cpu/torchvision-0.6.0%2Bcpu-cp37-cp37m-linux_x86_64.whl

Errors and troubleshooting for distributed PyTorch

The following sections describe common error messages and troubleshooting guidance for the classes: PyTorch DataParallel or PyTorch DistributedDataParallel. Most of these errors can likely be resolved with TorchDistributor, which is available on Databricks Runtime ML 13.0 and above. However, if TorchDistributor is not a viable solution, recommended solutions are also provided within each section.

The following is an example of how to use TorchDistributor:

Python

from pyspark.ml.torch.distributor import TorchDistributor

def train_fn(learning_rate):
        # ...

num_processes=2
distributor = TorchDistributor(num_processes=num_processes, local_mode=True)

distributor.run(train_fn, 1e-3)

process 0 terminated with exit code 1

The following error can occur when using notebooks in Databricks or locally:

process 0 terminated with exit code 1

To avoid this error, use torch.multiprocessing.start_processes with start_method=fork instead of torch.multiprocessing.spawn.

For example:

Python
import torch

def train_fn(rank, learning_rate):
    # required setup, e.g. setup(rank)
        # ...

num_processes = 2
torch.multiprocessing.start_processes(train_fn, args=(1e-3,), nprocs=num_processes, start_method="fork")

The server socket has failed to bind to port

The following error appears when you restart the distributed training after interrupting the cell during training:

The server socket has failed to bind to [::]:{PORT NUMBER} (errno: 98 - Address already in use).

To fix the issue, restart the cluster. If restarting does not solve the problem, there might be an error in the training function code.

You can run into additional issues with CUDA since start_method=”fork” is not CUDA-compatible. Using any .cuda commands in any cell might lead to failures. To avoid these errors, add the following check before you call torch.multiprocessing.start_method:

Python
if torch.cuda.is_initialized():
    raise Exception("CUDA was initialized; distributed training will fail.") # or something similar

Single node and distributed training​

Example notebooks​

MLflow PyTorch end-to-end model training notebook

PyTorch notebook

MLflow PyTorch model training notebook with TensorFlow

Install PyTorch​

Databricks Runtime for ML​

Databricks Runtime​

Errors and troubleshooting for distributed PyTorch​

process 0 terminated with exit code 1​

The server socket has failed to bind to port​

CUDA related errors​