Tracking Machine Learning Training Runs

Preview

  • This feature is in Public Preview.
  • The R API is not supported in the Public Preview, but is under development.

An MLflow run is a collection of source properties, parameters, metrics, tags, and artifacts related to training a machine learning model. Each run records the following information:

  • Source: Name of the notebook that launched the run or the project name and entry point for the run.
  • Version: Notebook revision if run from a notebook or Git commit hash if run from an MLflow Project.
  • Start & end time: Start and end time of the run.
  • Parameters: Key-value model parameters. Both keys and values are strings.
  • Tags: Key-value run metadata that can be updated during and after a run completes. Both keys and values are strings.
  • Metrics: Key-value model evaluation metrics.The value is numeric. Each metric can be updated throughout the course of the run (for example, to track how your model’s loss function is converging), and MLflow records and lets you visualize the metric’s history.
  • Artifacts: Output files in any format. For example, you can record images, models (for example, a pickled scikit-learn model), and data files (for example, a Parquet file) as an artifact.

You start runs and record parameters, metrics, tags, and artifacts using the MLflow Tracking API.

An MLflow experiment is the primary unit of organization and access control for runs – all MLflow runs belong to an experiment. Each experiment lets you visualize, search, and compare runs, as well as download run artifacts or metadata for analysis in other tools. The experiment UI lets you perform the following key tasks:

  • List and compare runs
  • Search for runs by parameter or metric value
  • Visualize run metrics
  • Download run results

Requirements

Tracking using hosted MLflow requires Databricks Runtime >= 5.0, MLflow >= 0.8.2, and is supported in Python and Java.

Experiments

Experiments are located in the Workspace file tree.

If you participated in the MLflow private preview, your workspace will have a /Shared/experiments folder for sharing experiments across your organization. You can log to the default experiment in /Shared/experiments/Default Experiment or create a new shared experiment. If you did not participate in the private preview, you can create a /Shared/experiments folder.

You can also create experiments in your home folder under Users. An experiment’s name is the same as its workspace path. If you create an experiment using the mlflow.set_experiment(experiment_name) API, Databricks saves the experiment based on the name you give it.

You can control who can view, edit, and manage experiments by enabling Workspace access control.

Create an experiment

  1. Click the Workspace button Workspace Icon or the Home button Home Icon in the sidebar. Do one of the following:

    • Next to any folder, click the Menu Dropdown on the right side of the text and select Create > Experiment.

      ../../_images/mlflow-experiments-create.png
    • In the Workspace or a user folder, click Down Caret and select Create > Experiment.

  2. In the Create Experiment dialog, enter a fully-qualified path in the Workspace and an optional artifact location.

    Databricks supports DBFS and S3 artifact locations. To store artifacts in S3, specify a URI of the form s3://<bucket>/<path>. MLflow obtains credentials to access S3 from your clusters’s IAM role.

    If you do not specify an artifact location, artifacts are stored in dbfs:/databricks/mlflow/<experiment-id>.

  3. Click Create. An empty experiment displays.

    ../../_images/mlflow-experiment-new.png

Display an experiment

  1. Click the Workspace button Workspace Icon or the Home button Home Icon in the sidebar.
  2. Navigate to a folder containing an experiment.
  3. Click the experiment name.

Delete an experiment

  1. Click the Workspace button Workspace Icon or the Home button Home Icon in the sidebar.
  2. Navigate to a folder containing an experiment.
  3. Click the Menu Dropdown at the right side of the experiment and select Move to Trash.

Notebook experiments

Every notebook in a Databricks Workspace has its own MLflow experiment. When you use MLflow in a notebook, you can record runs in the notebook’s associated experiment.

A notebook experiment shares the same name as its corresponding notebook. Its experiment ID is the same as the notebook ID of its corresponding notebook. The notebook ID is the numerical identifier at the end of a Notebook URL.

Record runs in notebook experiments

When you run MLflow in a Databricks notebook, you can record runs in the notebook’s experiment using MLflow tracking APIs by referring to its experiment ID or experiment name. For an example, see the Start and record runs section.

Experiment auto-detection

If you are using the MLflow Python API in a notebook, MLflow automatically detects the notebook experiment when you create a run. If you use an MLflow version below 0.9.0, you must set the MLFLOW_AUTODETECT_EXPERIMENT_ID environment variable to enable this detection:

import os
os.environ["MLFLOW_AUTODETECT_EXPERIMENT_ID"] = "true"

If you use the Python API with MLflow 0.9.0 or later, no additional configuration is required: MLflow runs created in a notebook are logged to the notebook’s associated experiment.

View notebook experiments

To view the MLflow experiment associated with a notebook, click the MLflow Runs Link Icon icon in the notebook context bar.

../../_images/mlflow-notebook-experiments.gif

Start and record runs

You can start and record MLflow runs in Python or Java/Scala. The following sections summarize the steps. For example notebooks, see Quick Start.

In this section:

Python

  1. Install the PyPI library mlflow to a cluster.

  2. Import MLflow library:

    import mlflow
    
  3. Set an experiment name. This step is optional when running in Python notebooks. If you do not explicitly set an experiment, runs are logged to the notebook’s associated MLflow experiment, as described in Notebook experiments.

    mlflow.set_experiment("/Shared/experiments/Quick Start")
    
  4. Start an MLflow run:

    with mlflow.start_run() as run:
    
  5. Log parameters, metrics, and artifacts:

    # Log a parameter (key-value pair)
    mlflow.log_param("param1", 5)
    
    # Log a metric; metrics can be updated throughout the run
    mlflow.log_metric("foo", 1)
    mlflow.log_metric("foo", 2)
    mlflow.log_metric("foo", 3)
    
    # Log an artifact (output file)
    with open("output.txt", "w") as f:
        f.write("Hello world!")
    mlflow.log_artifact("output.txt")
    

Java/Scala

  1. Install the PyPI library mlflow and a Maven library org.mlflow:mlflow-client:0.8.2 to a cluster.

  2. Import MLflow and file libraries:

    import org.mlflow.tracking.MlflowClient
    import org.mlflow.api.proto.Service.RunStatus
    import java.io.{File,PrintWriter}
    
  3. Create MLflow client:

    val mlflowClient = new MlflowClient()
    
  4. Create a new experiment and obtain its experiment ID. This step is optional when running in notebooks because by default each notebook has its own MLflow experiment, as described in Notebook experiments.

    val expId = mlflowClient.createExperiment("/Shared/experiments/Quick Start")
    
  5. Create a new run and fetch its run ID:

    val runInfo = mlflowClient.createRun(expId)
    val runId = runInfo.getRunUuid()
    
  6. Log parameters, metrics, and file:

    // Log a parameter (key-value pair)
    mlflowClient.logParam(runId, "param1", "5")
    
    // Log a metric; metrics can be updated throughout the run
    mlflowClient.logMetric(runId, "foo", 1.0);
    mlflowClient.logMetric(runId, "foo", 2.0);
    mlflowClient.logMetric(runId, "foo", 3.0);
    
    // Create and log an artifact (output file)
    new PrintWriter("/tmp/output.txt") { write("Hello, world!") ; close }
    mlflowClient.logArtifact(runId, new File("/tmp/output.txt"))
    
  7. Close the run:

    mlflowClient.setTerminated(runId, RunStatus.FINISHED, System.currentTimeMillis())
    

View and manage runs in experiments

Within an experiment you can perform many operations on its contained runs.

Toggle display

Use the toggle List Grid Toggle to switch the display of parameters and metrics lists between horizontal and vertical. The vertical display is useful when you have a lot of parameters or metrics.

../../_images/mlflow-run-horizontal.png

Horizontal

../../_images/mlflow-run-vertical.png

Vertical

Filter runs

To filter runs by a parameter or metric name, type the parameter or metric name in the Filter [Params|Metric] field and press Enter.

To filter runs that match an expression containing parameter and metric values:

  1. In the Search Runs field, specify an expression. For example: 'metrics.r2 > 0.3'.

    ../../_images/mlflow-web-ui.png
  2. Click Search.

Download runs

  1. Select one or more runs.
  2. Click Download CSV. A CSV file containing the following fields downloads: Run ID,Name,Source Type,Source Name,User,Status,<parameter1>,<parameter2>,...,<metric1>,<metric2>,....

Display run details

Click the date link of a run. The run details screen displays. The fields in the detail page depend on whether you ran from a notebook or a Git project.

Notebook

If the run was launched locally in a Databricks notebook or job, it looks like:

../../_images/mlflow-run-local.png

The link in the Source field opens the specific notebook version used in the run.

../../_images/mlflow-notebook-rev.png
Git project

If the run was launched remotely from a Git project, it looks like:

../../_images/mlflow-run-remote.png

The link in the Source field opens the master branch of the Git project used in the run. The link in the Git Commit field opens the specific version of the project used in the run.

Compare runs

  1. Select two or more runs.

  2. Click Compare. Either select a metric name to display a graph of the metric or select parameters and metrics from the X-axis and Y-axis drop-down lists to generate a scatter plot.

    ../../_images/mlflow-compare-runs.png

    Choose Runs

    The Comparing <N> Runs screen displays. For example, here is a scatter plot. At the top right, the scatter plot has a number of controls for manipulating the plot.

    ../../_images/mlflow-run-comparison.png

    Create Scatter Plot

Delete runs

  1. Select the checkbox at the far left of one or more runs.

  2. Click Delete.

    ../../_images/mlflow-delete-run.png

After you delete a run you can still display it by selecting Deleted in the State field.

MLflow tracking servers

There are two types of MLflow tracking servers:

  • Databricks managed tracking server.
  • Tracking server that you run. To set up your own tracking server, follow the instructions in MLflow Tracking Servers. To view the MLflow UI of a tracking server you run, go to https://<mlflow-tracking-server>:5000.

Log to a tracking server from a notebook

The procedure for logging to a managed tracking server depends on the version of Databricks Runtime as follows:

  • Databricks Runtime 5.0 and above: the MLflow API logs to the Databricks managed tracking server with no configuration changes required.

  • Databricks Runtime 4.3 and below: you must configure the environments variables DATABRICKS_HOST and DATABRICKS_TOKEN, where DATABRICKS_TOKEN is your API token.

    1. Create a secret to contain your API token.

    2. In a notebook, run:

      import os
      os.environ['DATABRICKS_HOST'] = 'https://<databricks-instance>'
      os.environ['DATABRICKS_TOKEN'] = dbutils.secrets.get(scope = "<token-scope>", key = "token")
      

To log to a tracking server you run, configure your connection to your tracking server by calling mlflow.set_tracking_uri.

Log to a tracking server from the API or CLI

To run the MLflow API or CLI against a managed tracking server, follow the procedure according to how you have configured the CLI:

  • Use the CLI authentication configured in ~/.databrickscfg. The host and token are picked up from the DEFAULT profile.

    export MLFLOW_TRACKING_URI=databricks
    
  • Use a specific profile configured in ~/.databrickscfg.

    export MLFLOW_TRACKING_URI=databricks://<profile>
    
  • Override ~/.databrickscfg or if CLI authentication is not configured:

    export MLFLOW_TRACKING_URI=databricks
    export DATABRICKS_HOST=<databricks-instance>
    export DATABRICKS_TOKEN=<token>
    

To run the MLflow API or CLI against a tracking server that you run:

export MLFLOW_TRACKING_URI=https://<mlflow-tracking-server>:5000
export DATABRICKS_TOKEN=<token>