Use Apache Spark MLlib on Databricks

Apache Spark MLlib is the Apache Spark machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. Databricks recommends the following Apache Spark MLLib guides:

Example notebooks

The following notebooks demonstrate how to use various Apache Spark MLlib features using Databricks.

Binary classification example

This notebook shows you how to build a binary classification application using the Apache Spark MLlib Pipelines API.

Binary classification notebook

Open notebook in new tab

Decision trees examples

These examples demonstrate various applications of decision trees using the Apache Spark MLlib Pipelines API.

Decision trees

These notebooks show you how to perform classifications with decision trees.

Decision trees for digit recognition notebook

Open notebook in new tab

Decision trees for SFO survey notebook

Open notebook in new tab

GBT regression using MLlib pipelines

This notebook shows you how to use MLlib pipelines to perform a regression using gradient boosted trees to predict bike rental counts (per hour) from information such as day of the week, weather, season, and so on.

Bike sharing regression notebook

Open notebook in new tab

Apache Spark MLlib pipelines and Structured Streaming example

This notebook shows how to train an Apache Spark MLlib pipeline on historic data and apply it to streaming data.

MLlib pipeline Structured Streaming notebook

Open notebook in new tab

Advanced Apache Spark MLlib example

This notebook illustrates how to create a custom transformer.

Custom transformer notebook

Open notebook in new tab

For reference information about MLlib features, Databricks recommends the following Apache Spark API reference:

For using Apache Spark MLlib from R, refer to the R machine learning documentation.

For Databricks support for visualizing machine learning algorithms, see Machine learning visualizations.