This topic provides an overview of machine learning capabilities in Databricks.
In this guide:
Databricks provides Databricks Runtime for Machine Learning (Databricks Runtime ML), a machine learning runtime that contains multiple popular libraries, including TensorFlow, PyTorch, Keras, and XGBoost. It also supports distributed training using Horovod. Databricks Runtime ML provides a ready-to-go environment for machine learning and data science, freeing you from having to install and configure these libraries on your cluster.
Apache Spark MLlib is the Apache Spark machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. Databricks recommends the following Apache Spark MLLib guides:
For using MLlib from R, refer to the R machine learning documentation.
For Databricks support for visualizing machine learning algorithms, see Machine learning visualizations.
The following topics and notebooks demonstrate how to use various Spark MLlib features in Databricks.
After developing ML models, the next step is productionizing the trained models. A typical workflow of the productionization in Databricks involves the steps:
- Export a trained model.
- Import the model into an external system.
Databricks supports two methods to export and import models and full ML pipelines from Apache Spark: MLeap and Databricks ML Model Export.
MLeap, which Databricks recommends, is a common serialization format and execution engine for machine learning pipelines. It supports serializing Apache Spark, scikit-learn, and TensorFlow pipelines into a bundle, so you can load and deploy your trained models to make predictions with new data.
You can also use Databricks ML Model Export to export models and ML pipelines. These exported models and pipelines can be imported into other (Spark and non-Spark) platforms to do scoring and make predictions.
This section provides instructions and examples of how to install, configure, and run some of the most popular third-party ML tools in Databricks.