Distributed Deep LearningΒΆ

Distributed deep learning involves training a deep neural network in parallel across multiple machines. A typical workflow has three components that run concurrently: model training, model evaluation (on a held-out validation set), and monitoring.

When possible, we recommend training neural networks on a single machine; distributed training code is more complex than single-machine training and slower due to communication overhead. However, you should consider distributed training if your model or your data are too large to fit in memory on a single machine.

For more information about distributed training, see the guides below, which explain how to run TensorFlow and Keras-backed distributed deep learning workflows on Databricks using the following frameworks:

  • Horovod: Supports single and multi-machine TensorFlow and Keras workflows.
  • TensorFlowOnSpark: Supports multi-machine TensorFlow workloads.
  • dist-keras: Supports multi-machine Keras workloads.

We recommend Horovod for both TensorFlow and Keras-backed workloads due to its ease of use in single-machine-multi-GPU and multi-machine-multi-GPU contexts.