Deep Learning Pipelines

Deep Learning Pipelines is a high-level deep learning framework that facilitates common deep learning workflows via the Apache Spark MLlib Pipelines API and scales out deep learning on big data using Spark. It is an open source project and employs the Apache 2.0 License. For details about the library, refer to the Deep Learning Pipelines GitHub page.

Deep Learning Pipelines calls into lower-level deep learning libraries. It currently supports TensorFlow and Keras with the TensorFlow-backend.

Note

The Deep Learning Pipelines library is included in Databricks Runtime ML, a machine learning runtime that provides a ready-to-go environment for machine learning and data science. Instead of installing Deep Learning Pipelines using the instructions in the “Cluster setup” section of the notebook below, you can simply create a cluster using Databricks Runtime ML. See Databricks Runtime for Machine Learning.

Migration guide to Databricks Runtime 7.0 ML and above

Important

Parts of the Deep Learning Pipelines library sparkdl are removed in Databricks Runtime 7.0 ML. Specifically, the Transformers and Estimators used in Apache Spark ML pipelines are deprecated in and are scheduled to be removed in Databricks Runtime 7.0 ML. See the following sections for migration tips and workarounds.

Reading images

Deep Learning Pipelines includes an image reader sparkdl.image.imageIO, which is removed in Databricks Runtime 7.0 ML.

Instead, use the image data source or binary file data source from Apache Spark. Many of the example notebooks in Deep learning show use cases of these two data sources.

Transfer learning

Deep Learning Pipelines includes a Spark ML Transformer sparkdl.DeepImageFeaturizer for facilitating transfer learning with deep learning models. DeepImageFeaturizer is removed in Databricks Runtime 7.0 ML.

Instead, use pandas UDFs to perform featurization with deep learning models. pandas UDFs, and their newer variant Scalar Iterator pandas UDFs, offer more flexible APIs, support more deep learning libraries, and give higher performance.

Refer to Featurization for examples of transfer learning with pandas UDFs.

Distributed hyperparameter tuning

Deep Learning Pipelines includes a Spark ML Estimator sparkdl.KerasImageFileEstimator for tuning hyperparameters using Spark ML tuning utilities. KerasImageFileEstimator is removed in Databricks Runtime 7.0 ML.

Instead, use Hyperopt to distribute hyperparameter tuning for deep learning models.

Distributed inference

Deep Learning Pipelines includes several Spark ML Transformers for distributing inference, all of which are removed in Databricks Runtime 7.0 ML:

  • DeepImagePredictor
  • TFImageTransformer
  • KerasImageFileTransformer
  • TFTransformer
  • KerasTransformer

Instead, use pandas UDFs to run inference on Spark DataFrames, following the examples in Model inference.

Deploying models as SQL UDFs

Deep Learning Pipelines includes a utility sparkdl.udf.keras_image_model.registerKerasImageUDF for deploying a deep learning model as a UDF callable from Spark SQL. registerKerasImageUDF is removed in Databricks Runtime 7.0 ML.

Instead, use MLflow to export the model as a UDF, following the example in Model inference.