Distributed training

When possible, Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine. For these workloads, Databricks Runtime ML includes the Horovod and spark-tensorflow-distributor packages.


Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. Databricks supports distributed deep learning training using HorovodRunner and the horovod.spark package. For Spark ML pipeline applications using Keras or PyTorch, you can use the horovod.spark estimator API.


Databricks Runtime ML.

Use Horovod

The following articles provide general information about distributed deep learning with Horovod and example notebooks illustrating how to use HorovodRunner and the horovod.spark package.

Troubleshoot Horovod installation

Problem: Importing horovod.{torch|tensorflow} raises ImportError: Extension horovod.{torch|tensorflow} has not been built

Solution: Horovod comes pre-installed on Databricks Runtime ML, so this error typically occurs if updating an environment goes wrong. The error indicates that Horovod was installed before a required library (PyTorch or TensorFlow). Since Horovod is compiled during installation, horovod.{torch|tensorflow} will not get compiled if those packages aren’t present during the installation of Horovod. To fix the issue, follow these steps:

  1. Verify that you are on a Databricks Runtime ML cluster.
  2. Ensure that the PyTorch or TensorFlow package is already installed.
  3. Uninstall Horovod (%pip uninstall -y horovod).
  4. Install cmake (%pip install cmake).
  5. Reinstall horovod.


Databricks Runtime ML 6.6 and below support HorovodEstimator, which is similar to HorovodRunner but constrains you to TensorFlow Estimators and Apache Spark ML Pipeline APIs.


spark-tensorflow-distributor is an open-source native package in TensorFlow for distributed training with TensorFlow on Spark clusters.