Distributed training

When possible, Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine. For these workloads, Databricks Runtime ML includes the TorchDistributor, Horovod and spark-tensorflow-distributor packages.

Databricks also offers distributed training for Spark ML models with the pyspark.ml.connect module, see Train Spark ML models on Databricks Connect with pyspark.ml.connect.

DeepSpeed distributor

The DeepSpeed distributor is built on top of TorchDistributor and is a recommended solution for customers with models that require higher compute power, but are limited by memory constraints. DeepSpeed is an open-source library developed by Microsoft and offers optimized memory usage, reduced communication overhead, and advanced pipeline parallelism. Learn more about Distributed training with DeepSpeed distributor


TorchDistributor is an open-source module in PySpark that helps users do distributed training with PyTorch on their Spark clusters, so it lets you launch PyTorch training jobs as Spark jobs. Under-the-hood, it initializes the environment and the communication channels between the workers and utilizes the CLI command torch.distributed.run to run distributed training across the worker nodes. Learn more about Distributed training with TorchDistributor.


spark-tensorflow-distributor is an open-source native package in TensorFlow for distributed training with TensorFlow on Spark clusters. Learn more about Distributed training with TensorFlow 2.


Ray is an open-source framework that specializes in parallel compute processing for scaling ML workflows and AI applications. See What is Ray on Databricks?).

Horovod (Deprecated)


Horovod and HorovodRunner are now deprecated and will not be pre-installed in Databricks Runtime 16.0 ML and above. For distributed deep learning, Databricks recommends using TorchDistributor for distributed training with PyTorch or the tf.distribute.Strategy API for distributed training with TensorFlow.

Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. Databricks supports distributed deep learning training using HorovodRunner and the horovod.spark package. For Spark ML pipeline applications using Keras or PyTorch, you can use the horovod.spark estimator API. See Horovod.