Exporting Apache Spark ML Models and Pipelines

This section discusses the export part of a ML Model Export workflow; see Importing Models into Your Application for the import and scoring part of the workflow.

With Databricks ML Model Export, you can easily export your trained Apache Spark ML models and pipelines.

import com.databricks.ml.local.ModelExport
val lr = new LogisticRegression()
val model = lr.fit(trainingData)
// Export the model into provided directory
ModelExport.exportModel(lrModel, "<storage-location>")

For detailed code examples, see the example notebooks.

Model Export format

MLlib models are exported as JSON files, with a format matching the Spark ML persistence format. The key changes from MLlib’s format are:

  • Using JSON instead of Parquet
  • Adding extra metadata

The Model Export format has several benefits:

JSON
Simple, human-readable format that can be checked into version control systems
Matching MLlib format
Model Export stays in sync with MLlib standards and APIs
Scoring
The extra metadata from Databricks allows scoring outside of Spark

For example, exporting a logistic regression model produces a directory containing the following JSON files:

  • metadata, which contains the type of the model and how it was configured for training. This file matches MLlib’s metadata file.

    {"class":"org.apache.spark.ml.classification.LogisticRegressionModel",
     "paramMap":{
       "featuresCol":"features", // expected input field name
       "predictionCol":"prediction", // expected output filed name
       "aggregationDepth":2,
       "elasticNetParam":0.0,
       "family":"auto",
       "fitIntercept":true,
       "maxIter":100,
       "regParam":0.0,
       "standardization":true,
       "threshold":0.5,
       "tol":1.0E-6
     },
     "sparkVersion":"2.1.0",
     "timestamp":1488858051051,
     "uid":"logreg_a99aee74cfef"}
    
  • data, which contains the trained model parameters. This file matches MLlib’s data file.

{"numClasses":2,
 "numFeatures":13,
 "interceptVector": {
   "type":1, // MLlib vector format: 0 for sparse vector, 1 for dense vector
   "values":[-8.44645260967139]
 },
 "coefficientMatrix":{
   "type":1,
   "numRows":1,
   "numCols":4,
   "values":[-0.01747691176982096,1.542111173068903,0.700895509427004,0.025215711086829903],
   "isTransposed":true
 },
 "isMultinomial":false}
  • dbmlMetadata, which contains extra information specific to Databricks ML Model Export.
{"dbmlExportLibraryVersion":"0.1.1"}

Supported models

You can programmatically retrieve a list of supported models by calling ModelExport.supportedModels. As of Databricks Runtime 4.0, the following models are supported:

Note

Probabilistic classifiers ( decision tree classifiers, logistic regression, random forest classifiers, etc.) can output an additional probability field containing a vector of class probabilities.

Exporting models from Databricks

The following notebooks demonstrate how to export ML models from Databricks.

Export model in Scala

Notebook

This notebook is too large to display inline. Get notebook link.

Export model in Python

Notebook

This notebook is too large to display inline. Get notebook link.