Distributed Deep Learning with dist-keras

dist-keras is an open-source framework for distributed training of Keras models (deep neural networks). It leverages Apache Spark to distribute and coordinate the training computation, and runs training directly on data in Spark DataFrames. dist-keras provides a built-in set of optimization strategies, such as Downpour and Dynamic SGD. To learn more about the available optimization strategies, see the dist-keras README.

For single-machine training, see the Keras guide. For inference, we recommend that you use Deep Learning Pipelines, which leverages Spark to efficiently perform large-scale batch inference for Keras and TensorFlow models.

Installing dist-keras

Installing dist-keras is a two step process:

  1. Specify network configurations on the driver and Spark workers.
  2. Install the dist-keras library.

Specifying network configuration

For both CPU- and GPU-enabled clusters, dist-keras requires that you specify additional networking configurations on the driver and Spark workers before you install the library itself. We recommend that you set this configuration through an init script. The notebook below demonstrates how:

Installing on CPU clusters

On CPU-only clusters, simply attach dist-keras to your cluster as a PyPi Library.

Installing on GPU clusters

When you use dist-keras on GPU-enabled clusters, you should leverage the tensorflow-gpu Python library to take advantage of GPU acceleration. However, dist-keras depends on the CPU-only build of TensorFlow by default. Therefore we recommend that you build dist-keras as an egg modified to depend upon tensorflow-gpu and attach it to your cluster as a library:

  1. Clone dist-keras to your local machine (git clone github.com/cerndb/dist-keras).
  2. Open setup.py (Python file in the root project directory).
  3. Modify the install_requires keyword argument (should be an array of Python dependencies); specifically, replace 'tensorflow' with 'tensorflow-gpu'.
  4. From the root project directory, run python setup.py bdist_egg to build dist-keras as an egg.
  5. Upload the egg file (the only file in ./dist; for example, ./dist/dist_keras-0.2.1-py2.7.egg) to Databricks and attach it to your cluster.
  6. Install the following additional Python dependencies as PyPi Libraries: tensorflow-gpu, keras, and h5py.

The example notebook below has been tested on GPU-enabled clusters using an egg built from commit 04cf7767e636cf614ea1fdb98753fe79647f81db of dist-keras.

Example Notebook

The example notebook below (adapted from dist-keras) describes how the library works and demonstrates various training workflows: