get-started-machine-learning(Python)

Loading...

Get started: Build your first machine learning model on Databricks

This example notebook illustrates how to train a machine learning classification model on Databricks. Databricks Runtime for Machine Learning comes with many libraries pre-installed, including scikit-learn for training and pre-processing algorithms, MLflow to track the model development process, and Hyperopt with SparkTrials to scale hyperparameter tuning.

In this notebook, you create a classification model to predict whether a wine is considered "high-quality". The dataset[1] consists of 11 features of different wines (for example, alcohol content, acidity, and residual sugar) and a quality ranking between 1 to 10.

This tutorial covers:

  • Part 1: Train a classification model with MLflow tracking
  • Part 2: Hyperparameter tuning to improve model performance
  • Part 3: Save results and models to Unity Catalog

For more details on productionizing machine learning on Databricks including model lifecycle management and model inference, see the ML End to End Example (AWS | Azure | GCP).

[1] The example uses a dataset from the UCI Machine Learning Repository, presented in Modeling wine preferences by data mining from physicochemical properties [Cortez et al., 2009].

Requirements

  • Cluster running Databricks Runtime 13.3 LTS ML or above

Setup

In this section, you do the following:

  • Configure the MLflow client to use Unity Catalog as the model registry.
  • Set the catalog and schema where the model will be registered.
  • Read in the data and save it to tables in Unity Catalog.
  • Preprocess the data.

Configure MLflow client

By default, the MLflow Python client creates models in the Databricks workspace model registry. To save models in Unity Catalog, configure the MLflow client as shown in the following cell.

4

The following cell sets the catalog and schema where the model will be registered. You must have USE CATALOG privilege on the catalog, and USE_SCHEMA, CREATE_TABLE, and CREATE_MODEL privileges on the schema. Change the catalog and schema names in the following cell if necessary.

For more information about how to use Unity Catalog, see (AWS | Azure | GCP).

6

Read in data and save it to tables in Unity Catalog

The dataset is available in databricks-datasets. In the following cell, you read the data in from .csv files into Spark DataFrames. You then write the DataFrames to tables in Unity Catalog. This both persists the data and lets you control how to share it with others.

8

Preprocess data

10

11

Part 1. Train a classification model

Enable MLflow autologging

Next, train a classifier within the context of an MLflow run, which automatically logs the trained model and many associated metrics and parameters.

You can supplement the logging with additional metrics such as the model's AUC score on the test dataset.

Train the model

View MLflow runs

To view the logged training run, click the Experiment icon at the upper right of the notebook to display the experiment sidebar. If necessary, click the refresh icon to fetch and monitor the latest runs.

To display the more detailed MLflow experiment page, click the experiment page icon. This page allows you to compare runs and view details for specific runs (AWS | Azure | GCP).

Load models

You can also access the results for a specific run using the MLflow API. The code in the following cell illustrates how to load the model trained in a given MLflow run and use it to make predictions. You can also find code snippets for loading specific models on the MLflow run page (AWS | Azure | GCP).

18

Part 2. Hyperparameter tuning

At this point, you have trained a simple model and used the MLflow tracking service to organize your work. Next, you can perform more sophisticated tuning using Hyperopt.

Parallel training with Hyperopt and SparkTrials

Hyperopt is a Python library for hyperparameter tuning. For more information about using Hyperopt in Databricks, see the documentation (AWS | Azure | GCP).

You can use Hyperopt with SparkTrials to run hyperparameter sweeps and train multiple models in parallel. This reduces the time required to optimize model performance. MLflow tracking is integrated with Hyperopt to automatically log models and parameters.

21

Search runs to retrieve the best model

Because all of the runs are tracked by MLflow, you can retrieve the metrics and parameters for the best run using the MLflow search runs API to find the tuning run with the highest test auc.

This tuned model should perform better than the simpler models trained in Part 1.

23

Part 3. Save results and models to Unity Catalog

Write results back to Unity Catalog

Save model to Unity Catalog