Machine Learning

This topic provides an overview of machine learning capabilities in Databricks.

Databricks Runtime for Machine Learning

Databricks provides Databricks Runtime for Machine Learning (Databricks Runtime ML), a machine learning runtime that contains multiple popular libraries, including TensorFlow, PyTorch, Keras, and XGBoost. It also supports distributed training using Horovod. Databricks Runtime ML provides a ready-to-go environment for machine learning and data science, freeing you from having to install and configure these libraries on your cluster.

Apache Spark MLlib

Apache Spark MLlib is the Apache Spark machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. Databricks recommends the following Apache Spark MLLib guides:

For using MLlib from R, refer to the R machine learning documentation.

For Databricks support for visualizing machine learning algorithms, see Machine learning visualizations.

The following topics and notebooks demonstrate how to use various Spark MLlib features in Databricks.

Hyperparameter Tuning

Databricks includes multiple tools for hyperparameter tuning. Hyperparameters are ML algorithm configurations that are traditionally tuned by hand. Common hyperparameters include regularization parameters, the number of epochs for deep learning training, and the depth of decision trees. Automated hyperparameter tuning uses methods such as cross validation or train-validation splits to tune these configurations from data. These automated methods are integrated seamlessly with MLflow, which makes tracking and managing your hyperparameters much easier.

Exporting and Importing ML Models

After developing ML models, the next step is productionizing the trained models. A typical workflow of the productionization in Databricks involves the steps:

  1. Export a trained model.
  2. Import the model into an external system.

Databricks supports two methods to export and import models and full ML pipelines from Apache Spark: MLeap and Databricks ML Model Export.

MLeap, which Databricks recommends, is a common serialization format and execution engine for machine learning pipelines. It supports serializing Apache Spark, scikit-learn, and TensorFlow pipelines into a bundle, so you can load and deploy your trained models to make predictions with new data.

You can also use Databricks ML Model Export to export models and ML pipelines. These exported models and pipelines can be imported into other (Spark and non-Spark) platforms to do scoring and make predictions.

Third-Party Libraries

This section provides instructions and examples of how to install, configure, and run some of the most popular third-party ML tools in Databricks.

Advanced Topics

For guides on advanced topics in machine learning, see: