Monitor agent quality in production

This notebook runs Agent Evaluation on a sample of the requests served by an agent endpoint.

To run the notebook once, fill in the required parameters up top and click Run all.
To continuously monitor your production traffic, click Schedule to create a job to run the notebook periodically. For endpoints with a large number of requests, we recommend setting an hourly schedule.

The notebook creates a few artifacts:

A table that records a sample of the requests received by an agent endpoint along with the metrics calculated by Agent Evaluation on those requests.
A dashboard that visualizes the evaluation results.
An MLFlow experiment to track runs of mlflow.evaluate

The derived table has the name <inference_table>_request_logs_eval, where <inference_table> is the inference table associated with the agent endpoint. The dashboard is created automatically and is linked in the final cells of the notebook. You can use the table of contents at the left of the notebook to go directly to this cell.

Note: You should not need to edit this notebook, other than filling in the widgets at the top. This notebook requires either Serverless compute or a cluster with Databricks Runtime 15.2 or above.

Install dependencies

Initialize widgets

Helper methods

Read parameters

Set up table-name variables

Iniitialize mlflow experiment

Update the table with unprocessed requests

Mark rows for processing, mark the rest as "skipped"

10:

Run evaluate on unprocessed rows

11:

Perform topic detection

12:

Load the dashboard template

13:

Create or get the dashboard

14:

Query pass rate

15:

Clean up artifacts

agent-evaluation-online-monitoring-notebook(Python)

Monitor agent quality in production