Model export with MLeap

MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Apache Spark, scikit-learn, and TensorFlow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Apache Spark, scikit-learn, TensorFlow graphs, or an MLeap pipeline. This notebook demonstrates how to use MLeap to do the model export with MLlib. For an overview of the package and more examples, check out the MLeap documentation.

Requirements

To use MLeap, you must create a cluster running Databricks Runtime 13.3 LTS ML or below. These versions of Databricks Runtime ML have a custom version of MLeap preinstalled.

Note: Databricks Runtime ML does not support open source MLeap.

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, Tokenizer, HashingTF
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

df = spark.read.parquet("/databricks-datasets/news20.binary/data-001/training").select("text", "topic")
df.cache()
display(df)

df.printSchema()

labelIndexer = StringIndexer(inputCol="topic", outputCol="label", handleInvalid="keep")

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features")

dt = DecisionTreeClassifier()
pipeline = Pipeline(stages=[labelIndexer, tokenizer, hashingTF, dt])

paramGrid = ParamGridBuilder().addGrid(hashingTF.numFeatures, [1000, 2000]).build()
cv = CrossValidator(estimator=pipeline, evaluator=MulticlassClassificationEvaluator(), estimatorParamMaps=paramGrid)

cvModel = cv.fit(df)

model = cvModel.bestModel

sparkTransformed = model.transform(df)
display(sparkTransformed)

%sh 
rm -rf /tmp/mleap_python_model_export
mkdir /tmp/mleap_python_model_export

import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer

model.serializeToBundle("jar:file:/tmp/mleap_python_model_export/20news_pipeline-json.zip", sparkTransformed)

dbutils.fs.cp("file:/tmp/mleap_python_model_export/20news_pipeline-json.zip", "dbfs:/example/20news_pipeline-json.zip")
display(dbutils.fs.ls("dbfs:/example"))

from pyspark.ml import PipelineModel
deserializedPipeline = PipelineModel.deserializeFromBundle("jar:file:/tmp/mleap_python_model_export/20news_pipeline-json.zip")

test_df = spark.read.parquet("/databricks-datasets/news20.binary/data-001/test").select("text", "topic")
test_df.cache()
display(test_df)

exampleResults = deserializedPipeline.transform(test_df)
display(exampleResults)

mleap-model-export-demo-python(Python)

Model export with MLeap

Requirements

In this notebook

Train the model with MLlib

Define ML pipeline

Tune ML pipeline

Use MLeap to export the trained model

Download model files

Use MLeap to import the trained model

Import model to PySpark

Import to MLeap