MLlib and Machine Learning

This is the main machine learning (ML) guide. It provides an overview of ML capabilities in Databricks and Apache Spark, with links to other currently available guides.

This section of the Databricks docs covers Apache Spark MLlib, Databricks ML Model Export, and 3rd-party ML libraries.

See Deep Learning for deep learning libraries and integrations and see GraphX and GraphFrames for GraphFrames and other graph analytics libraries.

MLlib

MLlib is Apache Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives.

Databricks recommends the following Apache Spark MLLib guides:

For using MLlib with R, please refer to the Spark R Guide documentation.

Databricks ML Model Export

Databricks ML Model Export allows you to export models and full ML pipelines from Apache Spark. These exported models and pipelines can be imported into other (Spark and non-Spark) platforms to do scoring and make predictions. Model Export is targeted at low-latency, lightweight ML-powered applications. With Model Export, you can:

  • Use an existing model deployment system
  • Achieve very low latency (milliseconds)
  • Use ML models and pipelines in custom deployments

Workflow

A typical workflow using Model Export involves 3 steps:

  1. Fit an ML model in Databricks using Apache Spark MLlib.

  2. Export the model (as JSON files) in Databricks.

  3. Import the model into an external system.

    The scoring (a.k.a. inference) library takes JSON-encoded features.

    {"id":5923937,  // any metadata
     "features:": { // MLlib vector format: 0 for sparse vector, 1 for dense vector
       "type": 1,
       "values":[0.1, 1.3, 8.4, 4.2]}}
    

    The result is also encoded in JSON.

    {"id":5923937,
     "prediction": 1.0}