Monitor apps deployed using Agent Framework (MLflow 2)

Beta

This feature is in Beta.

MLflow 2

This page describes usage of Agent Evaluation version 0.22 with MLflow 2. Databricks recommends using MLflow 3, which is integrated with Agent Evaluation >1.0. In MLflow 3, Agent Evaluation APIs are now part of the mlflow package.

For information on this topic, see Production quality monitoring (running scorers automatically).

This page describes how to set up monitoring for generative AI apps deployed using Mosaic AI Agent Framework. For general information on using monitoring, such as the results schema, viewing results, using the UI, adding alerts, and managing monitors, see What is Lakehouse Monitoring for generative AI? (MLflow 2).

Lakehouse Monitoring for gen AI helps you track operational metrics like volume, latency, errors, and cost, as well as quality metrics like correctness and guideline adherence, using Mosaic AI Agent Evaluation AI judges.

How monitoring works:

How it works overview

The monitoring UI:

Lakehouse Monitoring for gen AI UI Hero

Requirements

Install the databricks-agents SDK in a Databricks notebook.

Python
%pip install databricks-agents>=0.22.0
dbutils.library.restartPython()

Serverless jobs must be enabled.
LLM Judge metrics require Partner-powered AI assistive features to be enabled. Other metrics, like latency, are supported regardless of this setting.
The endpoint creator (the user deploying the agent) must have CREATE VOLUME permissions on the schema selected to store the inference tables at deployment time. This ensures that relevant assessment and logging tables can be created in the schema. See Enable and disable inference tables.

Limitations

important

Traces can take up to 2 hours to be available through the monitoring UI.
Quality metrics can take additional 15 minutes to compute after the trace appears in the monitoring UI.

For more details, see Monitor execution and scheduling.

If you need lower latency, please contact your Databricks account representative.

Set up monitoring

When you deploy agents authored with ChatAgent or ChatModel using agents.deploy, basic monitoring is automatically set up. This includes:

Request volume tracking
Latency metrics
Error logging

This automatic monitoring doesn't include specific evaluation metrics like guideline adherence or safety, but provides essential telemetry to track your agent's usage and performance.

tip

To include end user 👍 / 👎 feedback in your monitor, see Provide feedback on a deployed agent (experimental) for instructions on how to attach feedback to your inference table.

Configure agent monitoring metrics

To add evaluation metrics to the automatic monitoring, use the update_monitor method:

important

A monitor must be attached to an MLflow Experiment. Each experiment can have only have one attached monitor (for a single endpoint). By default, update_monitor and create_monitor use the notebook's MLflow Experiment. To override this behavior and select a different experiment, use the experiment_id parameter.

Python
from databricks.agents.monitoring import update_monitor, AssessmentsSuiteConfig, BuiltinJudge, GuidelinesJudge

monitor = update_monitor(
    endpoint_name = "model-serving-endpoint-name",
    assessments_config = AssessmentsSuiteConfig(
        sample=1.0,  # Sample 100% of requests
        assessments=[
            # Builtin judges: "safety", "groundedness", "relevance_to_query", "chunk_relevance"
            BuiltinJudge(name='safety'),  # or {'name': 'safety'}
            BuiltinJudge(name='groundedness', sample_rate=0.4), # or {'name': 'groundedness', 'sample_rate': 0.4}
            BuiltinJudge(name='relevance_to_query'), # or {'name': 'relevance_to_query'}
            BuiltinJudge(name='chunk_relevance'), # or {'name': 'chunk_relevance'}
            # Create custom judges with the guidelines judge.
            GuidelinesJudge(guidelines={
              "english": ["The response must be in English"],
              "clarity": ["The response must be clear, coherent, and concise"],
            }),
        ],
    )
)

For agents not deployed with the automatic monitoring, you can set up monitoring with the create_monitor method:

Python
from databricks.agents.monitoring import create_monitor, AssessmentsSuiteConfig, BuiltinJudge, GuidelinesJudge

monitor = create_monitor(
    endpoint_name = "model-serving-endpoint-name",
    assessments_config = AssessmentsSuiteConfig(
        sample=1.0,  # Sample 100% of requests
        assessments=[
            # Builtin judges: "safety", "groundedness", "relevance_to_query", "chunk_relevance"
            BuiltinJudge(name='safety'),  # or {'name': 'safety'}
            BuiltinJudge(name='groundedness', sample_rate=0.4), # or {'name': 'groundedness', 'sample_rate': 0.4}
            BuiltinJudge(name='relevance_to_query'), # or {'name': 'relevance_to_query'}
            BuiltinJudge(name='chunk_relevance'), # or {'name': 'chunk_relevance'}
            # Create custom judges with the guidelines judge.
            GuidelinesJudge(guidelines={
              "english": ["The response must be in English"],
              "clarity": ["The response must be clear, coherent, and concise"],
            }),
        ],
    )
)

Both methods take the following inputs:

endpoint_name: str - Name of the model serving endpoint to monitor.
assessments_config: AssessmentsSuiteConfig | dict - Configuration for assessments computed by the monitor. The following parameters are supported:
- [Optional] sample: float - The global sample rate, that is, the fraction of requests to compute assessments over (between 0 and 1). Defaults to 1.0 (compute assessments for all traffic).
- [Optional] paused: bool - Whether the monitor is paused.
- [Optional] assessments: list[BuiltinJudge | GuidelinesJudge] A list of assessments that either are a built-in judge or the guidelines judge.
[Optional] experiment_id: The MLflow experiment where monitor results will be displayed. If not specified, the monitor uses the same experiment where the agent was originally logged.

BuiltinJudge takes the following arguments:

name: str - One of the built in judges supported in monitoring: "safety", "groundedness", "relevance_to_query", "chunk_relevance". For more details on the built in judges, see Built In Judges.
[Optional] sample_rate: float - The fraction of requests to compute this assessment over (between 0 and 1). Defaults to the global sample rate.

GuidelinesJudge takes the following arguments:

guidelines: dict[str, list[str]] - A dictionary containing guideline names and plain-text guidelines that are used to assert over the request / response. For more details on guidelines, see Guideline Adherence.
[Optional] sample_rate: float - The fraction of requests to compute guidelines over (between 0 and 1). Defaults to the global sample rate.

For more details, see the Python SDK documentation.

After creating a monitor, you will see a link to the monitoring UI in the cell output. The evaluation results can be viewed in this UI, and are stored in monitor.evaluated_traces_table. To view evaluated rows, run:

Python
display(spark.table(monitor.evaluated_traces_table).filter("evaluation_status != 'skipped'"))

Monitor execution and scheduling

important

Traces can take up to 2 hours to be available through the monitoring UI.
Quality metrics can take additional 30 minutes to compute after the trace appears in the monitoring UI.

When you create a monitor, it initiates a job that evaluates a sample of requests to your endpoint from the last 30 days. This initial evaluation can take several hours to complete, depending on the volume of requests and the sampling rate.

When a request is made to your endpoint, the following happens:

The request and its MLflow Trace are written to the inference table (15 - 30 minutes).
A scheduled job unpacks the inference table into two seperate tables: request_log, which contains the request and trace, and assessment_logs, which contains the user feedback (job runs every hour).
The monitoring job evaluates your specified sample of requests (job runs every 15 minutes).

Combined, these steps mean that the requests can take up to 2.5 hours to appear in the monitoring UI.

Monitors are backed by Databricks workflows. To manually trigger a refresh of a monitor (step 3), find the workflow with name [<endpoint_name>] Agent Monitoring Job and click Run now.

If you need lower latency, please contact your Databricks account representative.

Example notebook

The following example logs and deploys a simple agent, then enables monitoring on it.

Agent monitoring example notebook

Open notebook in new tab

Requirements​

Limitations​

Set up monitoring​

Configure agent monitoring metrics​

Monitor execution and scheduling​

Example notebook​

Agent monitoring example notebook

Requirements

Limitations

Set up monitoring

Configure agent monitoring metrics

Monitor execution and scheduling

Example notebook