Distributed Training

When possible, Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine.

Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.

Databricks supports distributed DL training via the HorovodRunner tool. HorovodRunner simplifies the process of migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU machines and multi-node clusters.

Note

HorovodEstimator has been deprecated as of Databricks Runtime 6.2 ML and is scheduled to be removed from Databricks Runtime 7.0 ML. HorovodEstimator is similar to HorovodRunner in providing Horovod support, but it constrains the user to TensorFlow Estimators and Spark ML Pipeline APIs.

These articles contain in-depth discussions HorovodRunner and HorovodEstimator, and example notebooks demonstrating each approach: