Distributed deep learning (DL) involves training and inferencing a deep neural network in parallel across multiple machines. When possible, Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine.
Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.
Databricks supports two methods for migrating to distributed training: HorovodRunner and HorovodEstimator. HorovodRunner is appropriate when you are migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU contexts. HorovodEstimator is appropriate when you are migrating from Spark ML pipelines.
The following sections introduce HorovodRunner and our recommended workflow for distributed training.
In this topic:
HorovodRunner provides the ability to launch Horovod training jobs as Spark jobs. The
HorovodRunner API supports the following methods:
- Create an instance of HorovodRunner.
run(self, main, **kwargs)
- Run a Horovod training job invoking
main(**kwargs). Both the main function and the keyword arguments are serialized using cloudpickle and distributed to cluster workers.
For details, see the HorovodRunner API documentation.
The general approach to developing a distributed training program using
- Create a
HorovodRunnerinstance initialized with the number of nodes.
- Define a Horovod training method according to the methods described in Horovod usage.
- Pass the training method to the
hr = HorovodRunner(np=2) def train(): hvd.init() hr.run(train)
A typical workflow for distributed deep learning training involves two main phases: data preparation and training. The following topics contain recommended approaches, options, and example notebooks.
Data preparation involves allocating shared storage for data loading and model checkpointing and preparing data for your selected training algorithms. These topics discuss each step:
These topics contain in-depth discussions of the two approaches that Databricks supports for doing DDL training, and example notebooks demonstrating each approach:
After training is completed, trained networks are deployed for inference. Databricks recommends loading data into a Spark DataFrame, applying the deep learning model in Pandas UDFs, and writing predictions out using Spark. The following topics provide an introduction to doing distributed DL model inference on Databricks.
The first section gives a high-level overview of the workflow to do model inference. The next section provides detailed examples of this workflow using TensorFlow, Keras, and PyTorch. The final section provides some tips for debugging and performance tuning model inference.