Monitor apps deployed using Agent Framework
This feature is in Beta.
This page describes how to set up monitoring for generative AI apps deployed using Mosaic AI Agent Framework. For general information on using monitoring, such as the results schema, viewing results, using the UI, adding alerts, and managing monitors, see What is Lakehouse Monitoring for generative AI?.
Lakehouse Monitoring for gen AI helps you track operational metrics like volume, latency, errors, and cost, as well as quality metrics like correctness and guideline adherence, using Mosaic AI Agent Evaluation AI judges.
How monitoring works:
The monitoring UI:
Requirements
- Install the databricks-agents SDK in a Databricks notebook.
%pip install databricks-agents>=0.18.1
dbutils.library.restartPython()
- Serverless jobs must be enabled.
- LLM Judge metrics require Partner-powered AI assistive features to be enabled. Other metrics, like latency, are supported regardless of this setting.
Limitations
- Traces can take up to 2 hours to be available through the monitoring UI.
- Quality metrics can take additional 15 minutes to compute after the trace appears in the monitoring UI.
For more details, see Monitor execution and scheduling.
If you need lower latency, please contact your Databricks account representative.
Set up monitoring
When you deploy agents authored with ChatAgent or ChatModel using agents.deploy
, basic monitoring is automatically set up. This includes:
- Request volume tracking
- Latency metrics
- Error logging
This automatic monitoring doesn't include specific evaluation metrics like guideline adherence or safety, but provides essential telemetry to track your agent's usage and performance.
To include end user 👍 / 👎 feedback in your monitor, see Provide feedback on a deployed agent (experimental) for instructions on how to attach feedback to your inference table.
Configure agent monitoring metrics
To add evaluation metrics to the automatic monitoring, use the update_monitor
method:
A monitor must be attached to an MLflow Experiment. Each experiment can have only have one attached monitor (for a single endpoint). By default, update_monitor
and create_monitor
use the notebook's MLflow Experiment. To override this behavior and select a different experiment, use the experiment_id
parameter.
from databricks.agents.monitoring import update_monitor
monitor = update_monitor(
endpoint_name = "model-serving-endpoint-name",
monitoring_config = {
"sample": 0.01, # Sample 1% of requests
"metrics": ['guideline_adherence', 'groundedness', 'safety', 'relevance_to_query'],
"global_guidelines": {
"english": ["The response must be in English"],
"clarity": ["The response must be clear, coherent, and concise"],
}
}
)
For agents not deployed with the automatic monitoring, you can set up monitoring with the create_monitor
method:
from databricks.agents.monitoring import create_monitor
monitor = create_monitor(
endpoint_name = "model-serving-endpoint-name",
monitoring_config = {
"sample": 0.01, # Sample 1% of requests
"metrics": ['guideline_adherence', 'groundedness', 'safety', 'relevance_to_query'],
"global_guidelines": {
"english": ["The response must be in English"],
"clarity": ["The response must be clear, coherent, and concise"],
}
}
)
Both methods take the following inputs:
endpoint_name: str
- Name of the model serving endpoint to monitor.monitoring_config: dict
- Configuration for the monitor. The following parameters are supported:sample: float
- The fraction of requests to evaluate (between 0 and 1).metrics: list[str]
- List of metrics to evaluate. Supported metrics areguideline_adherence
,groundedness
,safety
,relevance_to_query
, andchunk_relevance
. For more information on these metrics, see Built-in AI judges.[Optional] global_guidelines: dict[str, list[str]]
- Global guidelines to evaluate agent responses. See Guideline adherence.[Optional] paused: str
- EitherPAUSED
orUNPAUSED
.
[Optional] experiment_id
: The MLflow experiment where monitor results will be displayed. If not specified, the monitor uses the same experiment where the agent was originally logged.
You will see a link to the monitoring UI in the cell output. The evaluation results can be viewed in this UI, and are stored in monitor.evaluated_traces_table
. To view evaluated rows, run:
display(spark.table(monitor.evaluated_traces_table).filter("evaluation_status != 'skipped'"))
Monitor execution and scheduling
- Traces can take up to 2 hours to be available through the monitoring UI.
- Quality metrics can take additional 30 minutes to compute after the trace appears in the monitoring UI.
When you create a monitor, it initiates a job that evaluates a sample of requests to your endpoint from the last 30 days. This initial evaluation can take several hours to complete, depending on the volume of requests and the sampling rate.
When a request is made to your endpoint, the following happens:
- The request and its MLflow Trace are written to the inference table (15 - 30 minutes).
- A scheduled job unpacks the inference table into two seperate tables:
request_log
, which contains the request and trace, andassessment_logs
, which contains the user feedback (job runs every hour). - The monitoring job evaluates your specified sample of requests (job runs every 15 minutes).
Combined, these steps mean that the requests can take up to 2.5 hours to appear in the monitoring UI.
Monitors are backed by Databricks workflows. To manually trigger a refresh of a monitor (step 3), find the workflow with name [<endpoint_name>] Agent Monitoring Job
and click Run now.
If you need lower latency, please contact your Databricks account representative.
Example notebook
The following example logs and deploys a simple agent, then enables monitoring on it.