Distributed Deep Learning

Distributed deep learning (DL) involves training a deep neural network in parallel across multiple machines. When possible, Databricks recommends that you train neural networks on a single machine; distributed training code is more complex than single-machine training and slower due to communication overhead. However, you should consider distributed training if your model or your data are too large to fit in memory on a single machine.

Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.

Databricks supports two methods for migrating to distributed training: HorovodRunner and HorovodEstimator. HorovodRunner is appropriate when you are migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU contexts. HorovodEstimator is appropriate when you are migrating from Spark ML pipelines.

The following sections introduce HorovodRunner and our recommended workflow for distributed training.

Distributed training with HorovodRunner

HorovodRunner provides the ability to launch Horovod training jobs as Spark jobs. The HorovodRunner API supports the following methods:

init(self, np)
Create an instance of HorovodRunner.
run(self, main, **kwargs)
Run a Horovod training job invoking main(**kwargs). Both the main function and the keyword arguments are serialized using cloudpickle and distributed to cluster workers.

For details, see the HorovodRunner API documentation.

The general approach to developing a distributed training program using HorovodRunner is:

  1. Create a HorovodRunner instance initialized with the number of nodes.
  2. Define a Horovod training method according to the methods described in Horovod usage.
  3. Pass the training method to the HorovodRunner instance.

For example:

hr = HorovodRunner(np=2)

def train():
  hvd.init()

hr.run(train)

Distributed deep learning workflow

A typical workflow for distributed deep learning involves two main phases: data preparation and training. See the following topics for recommended approaches, options, and example notebooks.

Data preparation

Data preparation involves allocating shared storage for data loading and model checkpointing and preparing data for your selected training algorithms. The following topics discuss each step:

Distributed training

The following topics contain in-depth discussions of the two approaches that Databricks supports for doing DDL training, and example notebooks demonstrating each approach: