Distributed Deep Learning with dist-keras

dist-keras is an open-source framework for distributed training of Keras models (deep neural networks). It leverages Apache Spark to distribute and coordinate the training computation, and runs training directly on data in Spark DataFrames. dist-keras provides a built-in set of optimization strategies, such as Downpour and Dynamic SGD. To learn more about the available optimization strategies, see the dist-keras README.

For single-machine training, see the Keras guide. For inference, we recommend that you use Deep Learning Pipelines, which leverages Spark to efficiently perform large-scale batch inference for Keras and TensorFlow models.

Install dist-keras on CPU clusters

On CPU-only clusters, attach dist-keras and tensorflow to your cluster as a PyPi library.

Install dist-keras on GPU clusters

When you use dist-keras on GPU-enabled clusters, you should leverage the tensorflow-gpu Python library to take advantage of GPU acceleration. However, dist-keras depends on the CPU-only build of TensorFlow by default. Therefore we recommend that you build dist-keras as an egg modified to depend upon tensorflow-gpu and attach it to your cluster as a library:

  1. Clone dist-keras to your local machine (git clone https://github.com/cerndb/dist-keras).
  2. Open setup.py (Python file in the root project directory).
  3. Modify the install_requires keyword argument (should be an array of Python dependencies); specifically, replace 'tensorflow' with 'tensorflow-gpu'.
  4. From the root project directory, run python setup.py bdist_egg to build dist-keras as an egg.
  5. Upload the egg file (the only file in ./dist; for example, ./dist/dist_keras-0.2.1-py2.7.egg) to Databricks and attach it to your cluster.
  6. Install the following additional Python dependencies as PyPi Libraries: tensorflow-gpu, keras, and h5py.

The following example notebook describes how the library works and demonstrates various training workflows. The notebook has been tested on GPU-enabled clusters using an egg built from commit 04cf7767e636cf614ea1fdb98753fe79647f81db of dist-keras.