horovod.spark: distributed deep learning with Horovod

Learn how to use the horovod.spark package to perform distributed training of machine learning models.

horovod.spark on Databricks

Databricks supports the horovod.spark package, which provides an estimator API that you can use in ML pipelines with Keras and PyTorch. For details, see Horovod on Spark, which includes a section on Horovod on Databricks.

Note

  • Databricks installs the horovod package with dependencies. If you upgrade or downgrade these dependencies, there might be compatibility issues.

  • When using horovod.spark with custom callbacks in Keras, you must save models in the TensorFlow SavedModel format.

    • With TensorFlow 2.x, use the .tf suffix in the file name.

    • With TensorFlow 1.x, set the option save_weights_only=True.

Requirements

Databricks Runtime ML 7.4 or above.

Example: Distributed training function

Here is a basic example to run a distributed training function using horovod.spark:

def train():
  import horovod.tensorflow as hvd
  hvd.init()

import horovod.spark
horovod.spark.run(train, num_proc=2)

Example notebooks: Horovod Spark estimators using Keras and PyTorch

The following notebooks demonstrate how to use the Horovod Spark Estimator API with Keras and PyTorch.

Horovod Spark Estimator Keras notebook

Open notebook in new tab

Horovod Spark Estimator PyTorch notebook

Open notebook in new tab