This is the main machine learning (ML) guide. It provides an overview of ML capabilities in Databricks and Apache Spark.
In this guide:
MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles three primary functions:
- Tracking experiments to record and compare parameters and results (MLflow Tracking).
- Packaging ML code in a reusable, reproducible form in order to share with other data scientists or transfer to production (MLflow Projects).
- Managing and deploying models from a variety of ML libraries to a variety of model serving and inference platforms (MLflow Models).
MLflow is in Alpha. For information about MLflow, see the MLflow documentation.
The following topics provide an introduction to using MLflow on Databricks.
The first two topics provide an MLflow Quick Start. The first part of the Quick Start shows how to train ElasticNet models on a diabetes dataset and log the training parameters, metrics, and trained model to an MLflow tracking server. The second part of the Quick Start shows how to deploy the trained model on AWS SageMaker and use it to generate predictions.
The last topic shows how to fit a neural network on MNIST handwritten digit recognition data using PyTorch, log results to an MLflow tracking server, and view the results in the MLflow UI and TensorBoard.
To provide a ready-to-go environment for machine learning and data science, Databricks has developed Databricks Runtime ML, a machine learning runtime that contains multiple popular libraries, including TensorFlow, Keras, and XGBoost. It also supports distributed TensorFlow training using Horovod. Databricks Runtime ML frees you from having to install and configure these libraries on your Spark cluster yourself.
This runtime is in Beta. For information about the libraries included and how to create a cluster that uses Databricks Runtime ML, see Databricks Runtime ML.
Apache Spark MLlib is the Apache Spark scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. Spark MLLib seamlessly integrates with other Spark components such as Spark SQL, Spark Streaming, and DataFrames and is installed in the Databricks runtime.
Databricks recommends the following Apache Spark MLLib guides:
For using MLlib with R, refer to the Spark R Guide documentation.
The following topics and notebooks demonstrate how to use various Spark MLlib features in Databricks.
After building and testing ML models, the next step is productionizing the trained models. A typical workflow of the productionization in Databricks involves three steps:
- Fit an ML model using Apache Spark MLlib.
- Export the model.
- Import the model into an external system.
There are two ways to export and import models and full ML pipelines from Apache Spark: MLeap and Databricks ML Model Export
Databricks recommends MLeap, which is a common serialization format and execution engine for machine learning pipelines. It supports serializing Apache Spark, scikit-learn, and TensorFlow pipelines into a bundle, so you can load and deploy your trained models to make predictions with new data.
We also support Databricks ML Model Export to export models and ML pipelines. These exported models and pipelines can be imported into other (Spark and non-Spark) platforms to do scoring and make predictions.
- Exporting and importing ML Models and Pipelines with Databricks ML Model Export
- Exporting Apache Spark ML Models and Pipelines
- Importing Models into Your Application
This section provides instructions and examples of how to install, configure, and run some of the most popular third-party ML tools in Databricks.
- Third-Party Machine Learning Integrations
- H2O Sparkling Water
- XGBoost versions
- Install XGBoost
- Test the XGBoost installation
- Integrate XGBoost with ML pipelines