Skip to main content
Unlisted page
This page is unlisted. Search engines will not index it, and only users having a direct link can access it.

Monitor apps deployed using Agent Framework

Beta

This feature is in Beta.

This page describes how to set up monitoring for generative AI apps deployed using Mosaic AI Agent Framework. For general information on using monitoring, such as the results schema, viewing results, using the UI, adding alerts, and managing monitors, see What is Lakehouse Monitoring for generative AI?.

Lakehouse Monitoring for gen AI helps you track operational metrics like volume, latency, errors, and cost, as well as quality metrics like correctness and guideline adherence, using Mosaic AI Agent Evaluation AI judges.

How monitoring works:

How it works overview

The monitoring UI:

Lakehouse Monitoring for gen AI UI Hero

Requirements

Python
%pip install databricks-agents>=0.18.1
dbutils.library.restartPython()
  • Serverless jobs must be enabled.
  • LLM Judge metrics require Partner-powered AI assistive features to be enabled. Other metrics, like latency, are supported regardless of this setting.

Limitations

important
  • Traces can take up to 2 hours to be available through the monitoring UI.
  • Quality metrics can take additional 15 minutes to compute after the trace appears in the monitoring UI.

For more details, see Monitor execution and scheduling.

If you need lower latency, please contact your Databricks account representative.

Set up monitoring

When you deploy agents authored with ChatAgent or ChatModel using agents.deploy, basic monitoring is automatically set up. This includes:

  • Request volume tracking
  • Latency metrics
  • Error logging

This automatic monitoring doesn't include specific evaluation metrics like guideline adherence or safety, but provides essential telemetry to track your agent's usage and performance.

tip

To include end user 👍 / 👎 feedback in your monitor, see Provide feedback on a deployed agent (experimental) for instructions on how to attach feedback to your inference table.

Configure agent monitoring metrics

To add evaluation metrics to the automatic monitoring, use the update_monitor method:

important

A monitor must be attached to an MLflow Experiment. Each experiment can have only have one attached monitor (for a single endpoint). By default, update_monitor and create_monitor use the notebook's MLflow Experiment. To override this behavior and select a different experiment, use the experiment_id parameter.

Python
from databricks.agents.monitoring import update_monitor

monitor = update_monitor(
endpoint_name = "model-serving-endpoint-name",
monitoring_config = {
"sample": 0.01, # Sample 1% of requests
"metrics": ['guideline_adherence', 'groundedness', 'safety', 'relevance_to_query'],
"global_guidelines": {
"english": ["The response must be in English"],
"clarity": ["The response must be clear, coherent, and concise"],
}
}
)

For agents not deployed with the automatic monitoring, you can set up monitoring with the create_monitor method:

Python
from databricks.agents.monitoring import create_monitor

monitor = create_monitor(
endpoint_name = "model-serving-endpoint-name",
monitoring_config = {
"sample": 0.01, # Sample 1% of requests
"metrics": ['guideline_adherence', 'groundedness', 'safety', 'relevance_to_query'],
"global_guidelines": {
"english": ["The response must be in English"],
"clarity": ["The response must be clear, coherent, and concise"],
}
}
)

Both methods take the following inputs:

  • endpoint_name: str - Name of the model serving endpoint to monitor.
  • monitoring_config: dict - Configuration for the monitor. The following parameters are supported:
    • sample: float - The fraction of requests to evaluate (between 0 and 1).
    • metrics: list[str] - List of metrics to evaluate. Supported metrics are guideline_adherence, groundedness, safety, relevance_to_query, and chunk_relevance. For more information on these metrics, see Built-in AI judges.
    • [Optional] global_guidelines: dict[str, list[str]] - Global guidelines to evaluate agent responses. See Guideline adherence.
    • [Optional] paused: str - Either PAUSED or UNPAUSED.
  • [Optional] experiment_id: The MLflow experiment where monitor results will be displayed. If not specified, the monitor uses the same experiment where the agent was originally logged.

You will see a link to the monitoring UI in the cell output. The evaluation results can be viewed in this UI, and are stored in monitor.evaluated_traces_table. To view evaluated rows, run:

Python
display(spark.table(monitor.evaluated_traces_table).filter("evaluation_status != 'skipped'"))

Monitor execution and scheduling

important
  • Traces can take up to 2 hours to be available through the monitoring UI.
  • Quality metrics can take additional 30 minutes to compute after the trace appears in the monitoring UI.

When you create a monitor, it initiates a job that evaluates a sample of requests to your endpoint from the last 30 days. This initial evaluation can take several hours to complete, depending on the volume of requests and the sampling rate.

When a request is made to your endpoint, the following happens:

  1. The request and its MLflow Trace are written to the inference table (15 - 30 minutes).
  2. A scheduled job unpacks the inference table into two seperate tables: request_log, which contains the request and trace, and assessment_logs, which contains the user feedback (job runs every hour).
  3. The monitoring job evaluates your specified sample of requests (job runs every 15 minutes).

Combined, these steps mean that the requests can take up to 2.5 hours to appear in the monitoring UI.

Monitors are backed by Databricks workflows. To manually trigger a refresh of a monitor (step 3), find the workflow with name [<endpoint_name>] Agent Monitoring Job and click Run now.

If you need lower latency, please contact your Databricks account representative.

Example notebook

The following example logs and deploys a simple agent, then enables monitoring on it.

Agent monitoring example notebook

Open notebook in new tab