Distributed Deep Learning¶
Distributed deep learning involves training a deep neural network in parallel across multiple machines. A typical workflow has three components that run concurrently: model training, model evaluation (on a held-out validation set), and monitoring.
When possible, we recommend training neural networks on a single machine; distributed training code is more complex than single-machine training and slower due to communication overhead. However, you should consider distributed training if your model or your data are too large to fit in memory on a single machine.
For more information about distributed training, see the guides below, which explain how to run TensorFlow and Keras-backed distributed deep learning workflows on Databricks using the following frameworks:
- Horovod: Supports single and multi-machine TensorFlow and Keras workflows.
- TensorFlowOnSpark: Supports multi-machine TensorFlow workloads.
- dist-keras: Supports multi-machine Keras workloads.
We recommend Horovod for both TensorFlow and Keras-backed workloads due to its ease of use in single-machine-multi-GPU and multi-machine-multi-GPU contexts.
- Distributed Deep Learning with Horovod
- Horovod Example Notebooks
- Distributed Deep Learning with TensorFlowOnSpark
- TensorFlowOnSpark Example Notebooks
- Distributed Deep Learning with dist-keras