Distributed training

When possible, Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine.

Horovod

Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.

Databricks supports distributed DL training using HorovodRunner, a tool that simplifies the process of migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU machines and multi-node clusters.

spark-tensorflow-distributor

spark-tensorflow-distributor is an open-source native package in TensorFlow that helps users do distributed training with TensorFlow on their Spark clusters. It is built on top of distribute.Strategy, one of the major features in TensorFlow 2.

These articles contain in-depth discussions of Databricks distributed DL training tools and example notebooks demonstrating each approach:

Note

HorovodEstimator is removed from Databricks Runtime 7.0 ML. HorovodEstimator is similar to HorovodRunner in providing Horovod support, but it constrains you to TensorFlow Estimators and Apache Spark ML Pipeline APIs.