Deep Learning Pipelines is a high-level deep learning framework that facilitates common deep learning workflows via the Apache Spark MLlib Pipelines API and scales out deep learning on big data using Spark. It is an open source project and employs the Apache 2.0 License. For details about the library, refer to the Deep Learning Pipelines GitHub page.
Deep Learning Pipelines calls into lower-level deep learning libraries. It currently supports TensorFlow and Keras with the TensorFlow-backend.
The Deep Learning Pipelines library is included in Databricks Runtime ML, a machine learning runtime that provides a ready-to-go environment for machine learning and data science. Instead of installing Deep Learning Pipelines using the instructions in the “Cluster setup” section of the notebook below, you can simply create a cluster using Databricks Runtime ML. See Databricks Runtime for Machine Learning.
Parts of the Deep Learning Pipelines library
sparkdl are deprecated. Specifically, the Transformers and Estimators used in Apache Spark ML pipelines are deprecated in Databricks Runtime 6.2 ML and are scheduled to be removed in Databricks Runtime 7.0 ML. See the following sections for migration tips and workarounds.
Deep Learning Pipelines includes an image reader
sparkdl.image.imageIO, which is deprecated in Databricks Runtime 6.2 ML.
Deep Learning Pipelines includes a Spark ML Transformer
sparkdl.DeepImageFeaturizer for facilitating transfer learning with deep learning models.
DeepImageFeaturizer is deprecated in Databricks Runtime 6.2 ML.
Instead, use pandas UDFs to perform featurization with deep learning models. pandas UDFs, and their newer variant Scalar Iterator pandas UDFs, offer more flexible APIs, support more deep learning libraries, and give higher performance.
Refer to Featurization for examples of transfer learning with pandas UDFs.
Deep Learning Pipelines includes a Spark ML Estimator
sparkdl.KerasImageFileEstimator for tuning hyperparameters using Spark ML tuning utilities.
KerasImageFileEstimator is deprecated in Databricks Runtime 6.2 ML.
Instead, use Hyperopt to distribute hyperparameter tuning for deep learning models.
Deep Learning Pipelines includes several Spark ML Transformers for distributing inference, all of which are deprecated in Databricks Runtime 6.2 ML:
Instead, use pandas UDFs to run inference on Spark DataFrames, following the examples in Model Inference.
Deep Learning Pipelines includes a utility
sparkdl.udf.keras_image_model.registerKerasImageUDF for deploying a deep learning model as a UDF callable from Spark SQL.
registerKerasImageUDF is deprecated in Databricks Runtime 6.2 ML.
Instead, use MLflow to export the model as a UDF, following the example in _.