Production quality monitoring (running scorers automatically)
This feature is in Beta.
MLflow enables you to automatically run scorers on a sample of your production traces to continuously monitor quality.
Key benefits:
- Automated quality assessment without manual intervention
- Flexible sampling to balance coverage with computational cost
- Consistent evaluation using the same scorers from development
- Continuous monitoring with periodic background execution
Prerequisites
-
Install MLflow and required packages
Bashpip install --upgrade "mlflow[databricks]>=3.1.0" openai
-
Create an MLflow experiment by following the setup your environment quickstart.
-
Instrumented your production application with MLflow tracing
-
Access to a Unity Catalog schema with
CREATE TABLE
permissions in order to store the monitoring outputs.noteIf you are using a Databricks trial account, you have CREATE TABLE permissions on the Unity Catalog schema
workspace.default
.
Step 1: Test scorers on your production traces
First, we need to test that the scorers you will use in production can evaluate your traces.
If you used your production app as the predict_fn
in mlflow.genai.evaluate()
during development, your scorers are likely already compatible.
MLflow currently only supports using predefined scorers for production monitoring. Contact your Databricks account represenative if you need to run custom code-based or LLM-based scorers in production.
-
Use
mlflow.genai.evaluate()
to test the scorers on a sample of your tracesPythonimport mlflow
from mlflow.genai.scorers import (
Guidelines,
RelevanceToQuery,
RetrievalGroundedness,
RetrievalRelevance,
Safety,
)
# Get a sample of up to 10 traces from your experiment
traces = mlflow.search_traces(max_results=10)
# Run evaluation to test the scorers
mlflow.genai.evaluate(
data=traces,
scorers=[
RelevanceToQuery(),
RetrievalGroundedness(),
RetrievalRelevance(),
Safety(),
Guidelines(
name="mlflow_only",
# Guidelines can refer to the request and response.
guidelines="If the request is unrelated to MLflow, the response must refuse to answer.",
),
# You can have any number of guidelines.
Guidelines(
name="customer_service_tone",
guidelines="""The response must maintain our brand voice which is:
- Professional yet warm and conversational (avoid corporate jargon)
- Empathetic, acknowledging emotional context before jumping to solutions
- Proactive in offering help without being pushy
Specifically:
- If the customer expresses frustration, anger, or disappointment, the first sentence must acknowledge their emotion
- The response must use "I" statements to take ownership (e.g., "I understand" not "We understand")
- The response must avoid phrases that minimize concerns like "simply", "just", or "obviously"
- The response must end with a specific next step or open-ended offer to help, not generic closings""",
),
],
) -
Use the MLflow Trace UI to check which scorers ran
In this case, we notice that even though we ran the
RetrievalGroundedness()
andRetrievalRelevance()
scorers, they did not show up in the MLflow UI. This indicates these scorers do not work with our traces and thus, we should not enable them in the next step.
Step 2: Enable monitoring
Now, let's enable the monitoring service. Once enabled, the monitoring service will sync a copy of your evaluated traces from your MLflow Experiment to a Delta Table in the Unity Catalog schema you specify.
Once set, the Unity Catalog schema can not be changed.
- Using the UI
- Using the SDK
Follow the recording below to use the UI to enable the scorers that successfully ran in step 1. Selecting a sampling rate only runs the scorers on that percentage of traces (e.g., entering 1.0
will run the scorers on 100% of your traces and .2
will run on 20%, etc).
If you want to set the sampling rate per-scorer, you must use the SDK.
Use the below code snippet to enable the scorers that successfully ran in step 1. Selecting a sampling rate only runs the scorers on that percentage of traces (e.g., entering 1.0
will run the scorers on 100% of your traces and .2
will run on 20%, etc). Optionally, you can configure the sampling rate per scorer.
# These packages are automatically installed with mlflow[databricks]
from databricks.agents.monitoring import create_external_monitor, AssessmentsSuiteConfig, BuiltinJudge, GuidelinesJudge
external_monitor = create_external_monitor(
# Change to a Unity Catalog schema where you have CREATE TABLE permissions.
catalog_name="workspace",
schema_name="default",
assessments_config=AssessmentsSuiteConfig(
sample=1.0, # sampling rate
assessments=[
# Predefined scorers "safety", "groundedness", "relevance_to_query", "chunk_relevance"
BuiltinJudge(name="safety"), # or {'name': 'safety'}
BuiltinJudge(
name="groundedness", sample_rate=0.4
), # or {'name': 'groundedness', 'sample_rate': 0.4}
BuiltinJudge(
name="relevance_to_query"
), # or {'name': 'relevance_to_query'}
BuiltinJudge(name="chunk_relevance"), # or {'name': 'chunk_relevance'}
# Guidelines can refer to the request and response.
GuidelinesJudge(
guidelines={
# You can have any number of guidelines, each defined as a key-value pair.
"mlflow_only": [
"If the request is unrelated to MLflow, the response must refuse to answer."
], # Must be an array of strings
"customer_service_tone": [
"""The response must maintain our brand voice which is:
- Professional yet warm and conversational (avoid corporate jargon)
- Empathetic, acknowledging emotional context before jumping to solutions
- Proactive in offering help without being pushy
Specifically:
- If the customer expresses frustration, anger, or disappointment, the first sentence must acknowledge their emotion
- The response must use "I" statements to take ownership (e.g., "I understand" not "We understand")
- The response must avoid phrases that minimize concerns like "simply", "just", or "obviously"
- The response must end with a specific next step or open-ended offer to help, not generic closings"""
],
}
),
],
),
)
print(external_monitor)
Step 3. Updating your monitor
To change the scorers configuration, use update_external_monitor()
. The configuration is stateless - that is, it is completely overwritten by the update. To retrieve an existing configuration to modify, use get_external_monitor()
.
- Using the UI
- Using the SDK
Follow the recording below to use the UI to update the scorers.
# These packages are automatically installed with mlflow[databricks]
from databricks.agents.monitoring import update_external_monitor, get_external_monitor
import os
config = get_external_monitor(experiment_id=os.environ["MLFLOW_EXPERIMENT_ID"])
print(config)
external_monitor = update_external_monitor(
# You must pass the experiment_id of the experiment you want to update.
experiment_id=os.environ["MLFLOW_EXPERIMENT_ID"],
# Change to a Unity Catalog schema where you have CREATE TABLE permissions.
assessments_config=AssessmentsSuiteConfig(
sample=1.0, # sampling rate
assessments=[
# Predefined scorers "safety", "groundedness", "relevance_to_query", "chunk_relevance"
BuiltinJudge(name="safety"), # or {'name': 'safety'}
BuiltinJudge(
name="groundedness", sample_rate=0.4
), # or {'name': 'groundedness', 'sample_rate': 0.4}
BuiltinJudge(
name="relevance_to_query"
), # or {'name': 'relevance_to_query'}
BuiltinJudge(name="chunk_relevance"), # or {'name': 'chunk_relevance'}
# Guidelines can refer to the request and response.
GuidelinesJudge(
guidelines={
# You can have any number of guidelines, each defined as a key-value pair.
"mlflow_only": [
"If the request is unrelated to MLflow, the response must refuse to answer."
], # Must be an array of strings
"customer_service_tone": [
"""The response must maintain our brand voice which is:
- Professional yet warm and conversational (avoid corporate jargon)
- Empathetic, acknowledging emotional context before jumping to solutions
- Proactive in offering help without being pushy
Specifically:
- If the customer expresses frustration, anger, or disappointment, the first sentence must acknowledge their emotion
- The response must use "I" statements to take ownership (e.g., "I understand" not "We understand")
- The response must avoid phrases that minimize concerns like "simply", "just", or "obviously"
- The response must end with a specific next step or open-ended offer to help, not generic closings"""
],
}
),
],
),
)
print(external_monitor)
Step 4. Use monitoring results
The monitoring job will take ~15 - 30 minutes to run for the first time. After the initial run, it runs every 15 minutes. Note that if you have a large volume of production traffic, the job can take additional time to complete.
Each time the job runs, it:
- Runs each scorer on the sample of traces
- If you have different sampling rates per scorer, the monitoring job attempts to score as many of the same traces as possible. For example, if scorer A has a 20% sampling rate and scorer B has a 40% sampling rate, the same 20% of traces will be used for A and B.
- Attaches the feedback from the scorer to each trace in the MLflow Experiment
- Writes a copy of ALL traces (not just the ones sampled) to the Delta Table configured in Step 1.
You can view the monitoring results using the Trace tab in the MLflow Experiment. Alternatively, you can query the traces using SQL or Spark in the generated Delta Table.
Next steps
Continue your journey with these recommended actions and tutorials.
- Use production traces to improve your app's quality - Create semantic evaluation using LLMs
- Build evaluation datasets - Use the results of monitoring to curate low performing traces into evaluation datasets to improve their quality.
Reference guides
Explore detailed documentation for concepts and features mentioned in this guide.
- Production Monitoring - Deep dive into the production monitoring SDKs