Distributed Training

When possible, Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine.

Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.

Databricks supports two methods for migrating to distributed training: HorovodRunner and HorovodEstimator. HorovodRunner is appropriate when you are migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU contexts. HorovodEstimator is appropriate when you are migrating from Spark ML pipelines.

These topics contain in-depth discussions HorovodRunner and HorovodEstimator, and example notebooks demonstrating each approach: