Model Export with MLeap

MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Apache Spark, scikit-learn, and TensorFlow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Apache Spark, scikit-learn, TensorFlow graphs, or an MLeap pipeline. This notebook only demonstrates how to use MLeap to do the model export with MLlib. For an overview of the package and more examples, check out the MLeap documentation.

import ml.combust.bundle.BundleFile
import ml.combust.mleap.spark.SparkSupport._
import org.apache.spark.ml.bundle.SparkBundleContext
import resource._

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.tuning._

val df = spark.read.parquet("/databricks-datasets/news20.binary/data-001/training")
  .select("text", "topic")
df.cache()
display(df)

df.printSchema()

val labelIndexer = new StringIndexer()
  .setInputCol("topic")
  .setOutputCol("label")
  .setHandleInvalid("keep")

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("features")

val dt = new DecisionTreeClassifier()
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, tokenizer, hashingTF, dt))

val paramGrid = new ParamGridBuilder()
  .addGrid(hashingTF.numFeatures, Array(1000, 2000))
  .build()
val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new MulticlassClassificationEvaluator())
  .setEstimatorParamMaps(paramGrid)

val cvModel = cv.fit(df)

val model = cvModel.bestModel.asInstanceOf[PipelineModel]

val sparkTransformed = model.transform(df)

display(sparkTransformed)

%sh 
rm -rf /tmp/mleap_scala_model_export/
mkdir /tmp/mleap_scala_model_export/

import ml.combust.bundle.BundleFile
import ml.combust.mleap.spark.SparkSupport._
import org.apache.spark.ml.bundle.SparkBundleContext
import resource._

implicit val context = SparkBundleContext().withDataset(sparkTransformed)
//save our pipeline to a zip file
//MLeap can save a file to any supported java.nio.FileSystem
(for(modelFile <- managed(BundleFile("jar:file:/tmp/mleap_scala_model_export/20news_pipeline-json.zip"))) yield {
  model.writeBundle.save(modelFile)(context)
}).tried.get

(for(modelFile <- managed(BundleFile("file:/tmp/mleap_scala_model_export/20news_pipeline"))) yield {
  model.writeBundle.save(modelFile)(context)
}).tried.get

%sh ls /tmp/mleap_scala_model_export/

dbutils.fs.cp("file:/tmp/mleap_scala_model_export/20news_pipeline-json.zip", "dbfs:/example/20news_pipeline-json.zip")
display(dbutils.fs.ls("dbfs:/example"))

import ml.combust.bundle.BundleFile
import ml.combust.mleap.spark.SparkSupport._
import resource._

val zipBundle = (for(bundle <- managed(BundleFile("jar:file:/tmp/mleap_scala_model_export/20news_pipeline-json.zip"))) yield {
  bundle.loadSparkBundle().get
}).opt.get

val loadedModel = zipBundle.root

val test_df = spark.read.parquet("/databricks-datasets/news20.binary/data-001/test")
  .select("text", "topic")
test_df.cache()
display(test_df)

val exampleResults = loadedModel.transform(test_df)

display(exampleResults)

import ml.combust.bundle.BundleFile
import ml.combust.mleap.runtime.MleapSupport._
import resource._

val zipBundleM = (for(bundle <- managed(BundleFile("jar:file:/tmp/mleap_scala_model_export/20news_pipeline-json.zip"))) yield {
  bundle.loadMleapBundle().get
}).opt.get

val mleapPipeline = zipBundleM.root

import ml.combust.mleap.runtime.frame.{DefaultLeapFrame, Row}
import ml.combust.mleap.core.types._

val schema = StructType(StructField("text", ScalarType.String),
  StructField("topic", ScalarType.String)).get
val data = Seq(Row("From: ecktons@ucs.byu.edu (Sean Eckton) Subject: Re: Why is my mouse so JUMPY?  (MS MOUSE) Organization: Fine Arts and Communications -- Brigham Young University Distribution: world  My original post: >Subject: Re: Why is my mouse so JUMPY?  (MS MOUSE) >> I have a Microsoft Serial Mouse and am using mouse.com 8.00 (was using 8.20  >> I think, but switched to 8.00 to see if it was any better).  Vertical motion  >> is nice and smooth, but horizontal motion is so bad I sometimes can't click  >> on something because my mouse jumps around.  I can be moving the mouse to  >> the right with relatively uniform motion and the mouse will move smoothly  >> for a bit, then jump to the right, then move smoothly for a bit then jump  >> again (maybe this time to the left about .5 inch!).  This is crazy!  I have  >> never had so much trouble with a mouse before.  Anyone have any solutions?    Aha, I think I found the problem and it isn't dirt!  Another guy here was  using a different kind of mouse and was using 640x400x16 video driver (the  default VGA for Windows).  He has an S3 LocalBus card like I do and when I  loaded the S3 video driver in Windows for him, his mouse became jumpy too.   Seems like it is the S3 driver!  Is there any newer one than version 1.4  that would solve this problem?  It is really bad.  I have to use the  keyboard instead sometimes!  The s3-w31.zip on cica is version 1.4 (which is  the same version that came with my card).   --- Sean Eckton Computer Support Representative College of Fine Arts and Communications  D-406 HFAC Brigham Young University Provo, UT  84602 (801)378-3292  hfac_csr@byu.edu ecktons@ucs.byu.edu ", "comp.os.ms-windows.misc"))
val frame = DefaultLeapFrame(schema, data)

val frame2 = mleapPipeline.transform(frame).get
val data2 = frame2.dataset

// The prediction is stored in column with index 2:
frame2.schema.fields.zipWithIndex.foreach { case (field, idx) =>
  println(s"$idx $field")
}

// Get the prediction for Row 0
data2(0).getDouble(7)

Model Export with MLeap

Cluster Setup

In this Notebook

Training the Model by MLlib

Define ML Pipeline

Tune ML Pipeline

Export Trained Model with MLeap

Download Model Files

Import Trained Model with MLeap

Import to Spark

Import to MLeap