Experiment Tracking with MLflow Tracking

Note

This section describes MLflow features that are in Private Preview. To request access to the preview, contact your Databricks sales representative. If you are not participating in the preview, see the MLflow open-source documentation for information on how to run standalone MLflow.

The MLflow Tracking component consists of:

  • An API for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. The MLflow Tracking component lets you log and query experiments using Java, Python, R, and REST APIs.

  • A UI that lets you visualize, search, and compare runs, and download run artifacts or metadata for analysis in other tools.

    ../../_images/mlflow-web-ui.png

    The UI contains the following key features:

    • Experiment-based run listing and comparison
    • Searching for runs by parameter or metric value
    • Visualizing run metrics
    • Downloading run results

MLflow runs are tracked in local files or in tracking server. When running locally, MLflow logs to a local mlruns directory by default - run mlflow ui in the directory above it to display the tracking UI. When running in a notebook attached to a cluster running Databricks Runtime 5.0 and above, MLflow is automatically configured to log to a Databricks hosted tracking server. The MLflow tracking server also serves the UI and enables remote storage of run artifacts.

MLflow tracking servers

There are two types of MLflow tracking servers:

  • A server that you run. To set up your own tracking server, follow the instructions in MLflow Tracking Servers and configure your connection to your tracking server by running mlflow.set_tracking_uri. To view the MLflow UI of a tracking server you run, go to https://<mlflow-tracking-server>:5000.
  • A Databricks tracking server running in a Databricks workspace. To view the MLflow UI of a Databricks tracking server, go to https://<databricks-instance>/mlflow.

Important

  • When your cluster runs Databricks Runtime 5.0 and above, by default the MLflow API logs results to a Databricks MLflow tracking server. If you have not requested access to a Databricks MLflow tracking server, you must specify a URI to an MLflow tracking server using mlflow.set_tracking_uri.

  • When your cluster runs Databricks Runtime 4.3 and below and you are using a Databricks MLflow tracking server, you must set the environments variables DATABRICKS_HOST and DATABRICKS_TOKEN as follows:

    1. On your laptop using the Databricks CLI, run:

      databricks secrets create-scope --scope <token-scope> --profile <profile>
      databricks secrets put --scope <token-scope> --profile <profile> --key token
      

      where <token-scope> is a secret scope and <profile> is a Databricks CLI connection profile.

    2. In a notebook run:

      import os
      os.environ['DATABRICKS_HOST'] = 'https://<databricks-instance>'
      os.environ['DATABRICKS_TOKEN'] = dbutils.secrets.get(scope = "<token-scope>", key = "token")
      

Tracking examples

Quick start training

This notebook is part 1 of a Quick Start guide based on the MLflow tutorial. The notebook shows how to:

  • Install MLflow on a Databricks cluster
  • Train scikit-learn ElasticNet model on a diabetes dataset and log the training metrics, parameters, and model artifacts to a Databricks hosted tracking server
  • View the training results in the MLflow tracking server UI

To learn how to deploy the trained model on AWS SageMaker, see part 2, Quick start model deployment.

PyTorch

PyTorch is a Python package that provides GPU accelerated tensor computation and high level functionalities for building deep learning networks.

The MLflow PyTorch notebook fits a neural network on MNIST handwritten digit recognition data. The run results are logged to an MLflow server. Training metrics and weights in TensorFlow event format are logged locally and then uploaded to the MLflow run’s artifact directory. Finally TensorBoard is started and reads the events logged locally.

This example runs on Databricks Runtime 5.0 ML (Beta) and above. To install PyTorch on a cluster running Databricks Runtime 5.0 ML, run the PyTorch Init Script notebook to create an init script named pytorch-gpu-init.sh and configure your cluster with the pytorch-gpu-init.sh init script. If you run on Databricks Runtime 5.1 ML (Beta), you do not need to create the PyTorch init script and configure your cluster with the script.

If you want to run TensorBoard to read the artifacts uploaded to S3, see How to run TensorFlow on S3.

MLeap

The following notebook walks through the process of training a PySpark model, saving it in MLeap format, and deploying the saved MLeap model to SageMaker.