mleap-model-export-demo-python(Python)

Loading...

Model export with MLeap

MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Apache Spark, scikit-learn, and TensorFlow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Apache Spark, scikit-learn, TensorFlow graphs, or an MLeap pipeline. This notebook demonstrates how to use MLeap to do the model export with MLlib. For an overview of the package and more examples, check out the MLeap documentation.

Requirements

To use MLeap, you must create a cluster running Databricks Runtime 13.3 LTS ML or below. These versions of Databricks Runtime ML have a custom version of MLeap preinstalled.

Note: Databricks Runtime ML does not support open source MLeap.

In this notebook

This notebook demonstrates how to use MLeap to export a DecisionTreeClassifier from MLlib and how to load the saved PipelineModel to make predictions.

The basic workflow is as follows:

  • Model export
    • Fit a PipelineModel using MLlib.
    • Use MLeap to serialize the PipelineModel to zip file or to directory.
  • Move the PipelineModel files to your deployment project or data store.
  • In your project
    • Use MLeap to deserialize the saved PipelineModel.
    • Make predictions.

Train the model with MLlib

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, Tokenizer, HashingTF
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
df = spark.read.parquet("/databricks-datasets/news20.binary/data-001/training").select("text", "topic")
df.cache()
display(df)
df.printSchema()

Define ML pipeline

labelIndexer = StringIndexer(inputCol="topic", outputCol="label", handleInvalid="keep")
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features")
dt = DecisionTreeClassifier()
pipeline = Pipeline(stages=[labelIndexer, tokenizer, hashingTF, dt])

Tune ML pipeline

paramGrid = ParamGridBuilder().addGrid(hashingTF.numFeatures, [1000, 2000]).build()
cv = CrossValidator(estimator=pipeline, evaluator=MulticlassClassificationEvaluator(), estimatorParamMaps=paramGrid)
cvModel = cv.fit(df)
model = cvModel.bestModel
sparkTransformed = model.transform(df)
display(sparkTransformed)

Use MLeap to export the trained model

MLeap supports serializing the model to one zip file. In order to serialize to a zip file, make sure the URI begins with jar:file and ends with a .zip.

%sh 
rm -rf /tmp/mleap_python_model_export
mkdir /tmp/mleap_python_model_export
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer

model.serializeToBundle("jar:file:/tmp/mleap_python_model_export/20news_pipeline-json.zip", sparkTransformed)

Download model files

In this example you download the model files from the browser. In general, you may want to programmatically move the model to a persistent storage layer.

dbutils.fs.cp("file:/tmp/mleap_python_model_export/20news_pipeline-json.zip", "dbfs:/example/20news_pipeline-json.zip")
display(dbutils.fs.ls("dbfs:/example"))

Get a link to the downloadable zip at: https://<databricks-instance>/files/<file-name>.zip.

Use MLeap to import the trained model

This section shows how to use MLeap to load a trained model for use in your application. To use existing ML models and pipelines to make predictions for new data, you can deserialize the model from the file you saved.

Import model to PySpark

This section shows how to load an MLeap bundle and make predictions on a Spark DataFrame. This can be useful if you want to use the same persistence format (bundle) for loading into Spark and non-Spark applications. If your goal is to make predictions only in Spark, use MLlib's native ML persistence.

from pyspark.ml import PipelineModel
deserializedPipeline = PipelineModel.deserializeFromBundle("jar:file:/tmp/mleap_python_model_export/20news_pipeline-json.zip")

Use the loaded model to make predictions.

test_df = spark.read.parquet("/databricks-datasets/news20.binary/data-001/test").select("text", "topic")
test_df.cache()
display(test_df)
exampleResults = deserializedPipeline.transform(test_df)
display(exampleResults)

Import to MLeap

The primary use of MLeap is to import models into applications without Spark available. These applications should be implemented in Scala or Java.