Skip to main content

Production quality monitoring (running scorers automatically)

Beta

This feature is in Beta.

MLflow enables you to automatically run scorers on a sample of your production traces to continuously monitor quality.

Key benefits:

  • Automated quality assessment without manual intervention
  • Flexible sampling to balance coverage with computational cost
  • Consistent evaluation using the same scorers from development
  • Continuous monitoring with periodic background execution

Prerequisites

  1. Install MLflow and required packages

    Bash
    pip install --upgrade "mlflow[databricks]>=3.1.0" openai
  2. Create an MLflow experiment by following the setup your environment quickstart.

  3. Instrumented your production application with MLflow tracing

  4. Access to a Unity Catalog schema with CREATE TABLE permissions in order to store the monitoring outputs.

Step 1: Test scorers on your production traces

First, we need to test that the scorers you will use in production can evaluate your traces.

tip

If you used your production app as the predict_fn in mlflow.genai.evaluate() during development, your scorers are likely already compatible.

warning

MLflow currently only supports using predefined scorers for production monitoring. Contact your Databricks account represenative if you need to run custom code-based or LLM-based scorers in production.

  1. Use mlflow.genai.evaluate() to test the scorers on a sample of your traces

    Python
    import mlflow

    from mlflow.genai.scorers import (
    Guidelines,
    RelevanceToQuery,
    RetrievalGroundedness,
    RetrievalRelevance,
    Safety,
    )

    # Get a sample of up to 10 traces from your experiment
    traces = mlflow.search_traces(max_results=10)

    # Run evaluation to test the scorers
    mlflow.genai.evaluate(
    data=traces,
    scorers=[
    RelevanceToQuery(),
    RetrievalGroundedness(),
    RetrievalRelevance(),
    Safety(),
    Guidelines(
    name="mlflow_only",
    # Guidelines can refer to the request and response.
    guidelines="If the request is unrelated to MLflow, the response must refuse to answer.",
    ),
    # You can have any number of guidelines.
    Guidelines(
    name="customer_service_tone",
    guidelines="""The response must maintain our brand voice which is:
    - Professional yet warm and conversational (avoid corporate jargon)
    - Empathetic, acknowledging emotional context before jumping to solutions
    - Proactive in offering help without being pushy

    Specifically:
    - If the customer expresses frustration, anger, or disappointment, the first sentence must acknowledge their emotion
    - The response must use "I" statements to take ownership (e.g., "I understand" not "We understand")
    - The response must avoid phrases that minimize concerns like "simply", "just", or "obviously"
    - The response must end with a specific next step or open-ended offer to help, not generic closings""",
    ),
    ],
    )
  2. Use the MLflow Trace UI to check which scorers ran

    In this case, we notice that even though we ran the RetrievalGroundedness() and RetrievalRelevance() scorers, they did not show up in the MLflow UI. This indicates these scorers do not work with our traces and thus, we should not enable them in the next step.

Step 2: Enable monitoring

Now, let's enable the monitoring service. Once enabled, the monitoring service will sync a copy of your evaluated traces from your MLflow Experiment to a Delta Table in the Unity Catalog schema you specify.

important

Once set, the Unity Catalog schema can not be changed.

Follow the recording below to use the UI to enable the scorers that successfully ran in step 1. Selecting a sampling rate only runs the scorers on that percentage of traces (e.g., entering 1.0 will run the scorers on 100% of your traces and .2 will run on 20%, etc).

If you want to set the sampling rate per-scorer, you must use the SDK.

trace

Step 3. Updating your monitor

To change the scorers configuration, use update_external_monitor(). The configuration is stateless - that is, it is completely overwritten by the update. To retrieve an existing configuration to modify, use get_external_monitor().

Follow the recording below to use the UI to update the scorers.

trace

Step 4. Use monitoring results

The monitoring job will take ~15 - 30 minutes to run for the first time. After the initial run, it runs every 15 minutes. Note that if you have a large volume of production traffic, the job can take additional time to complete.

Each time the job runs, it:

  1. Runs each scorer on the sample of traces
    • If you have different sampling rates per scorer, the monitoring job attempts to score as many of the same traces as possible. For example, if scorer A has a 20% sampling rate and scorer B has a 40% sampling rate, the same 20% of traces will be used for A and B.
  2. Attaches the feedback from the scorer to each trace in the MLflow Experiment
  3. Writes a copy of ALL traces (not just the ones sampled) to the Delta Table configured in Step 1.

You can view the monitoring results using the Trace tab in the MLflow Experiment. Alternatively, you can query the traces using SQL or Spark in the generated Delta Table.

Next steps

Continue your journey with these recommended actions and tutorials.

Reference guides

Explore detailed documentation for concepts and features mentioned in this guide.