Distributed training

When possible, Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine.

Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.

Databricks supports distributed DL training using HorovodRunner, a tool that simplifies the process of migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU machines and multi-node clusters.

These articles contain in-depth discussions of Databricks distributed DL training tools and example notebooks demonstrating each approach:

Note

HorovodEstimator has been deprecated as of Databricks Runtime 6.2 ML and is scheduled to be removed from Databricks Runtime 7.0 ML. HorovodEstimator is similar to HorovodRunner in providing Horovod support, but it constrains you to TensorFlow Estimators and Apache Spark ML Pipeline APIs.