Mosaic AI Agent Evaluation: Custom metrics, guidelines and domain expert labels

This notebook demonstrates how to evaluate a GenAI app using Agent Evaluation's proprietary LLM judges, custom metrics, and labels from domain experts. It demonstrates:

Loading production logs (traces) into an evaluation dataset
Running evaluation and doing root cause analysis
Writing custom metrics to automatically detect quality issues
Sending production logs for SMEs to label and evolve the evaluation dataset

To get your agent ready for pre-production, see the Mosaic AI agent demo notebook (AWS | Azure).

To learn more about Mosaic AI Agent Evaluation, see Databricks documentation (AWS | Azure).

Requirements

See the requirements of Agent Evaluation (AWS | Azure)
Serverless or classic cluster running Databricks Runtime 15.4 LTS or above, or Databricks Runtime for Machine Learning 15.4 LTS or above.
CREATE TABLE access in a Unity Catalog Schema

2

4

A simple tool calling agent

Below is a simple tool calling agent, built with LangGraph, that has 2 tools:

multiply, which takes 2 numbers and multiplies them
query_docs, which takes a set of keywords, and returns relevant docs about Databricks using keyword search.

For the purposes of this demo notebook, it not important how the Agent code works - this demo focuses on how to evaluate the Agent's quality.

Note: Agent Evaluation works with any GenAI app, no matter how it is built, as long as the app can accept a Dict[str, Any] input and returns a Dict[str, Any] output.

For more examples of tools to add to your agent, see Databricks documentation (AWS | Azure)

6

Select (pre)production logs

Since this is a demo notebook, we generate example production logs below to demonstrate the new features in Agent Evaluation. We call our agent directly and log traces in MLFlow.

NOTE: MLFlow tracing will visualize each trace (with pagination) in the cell output when you call your agent or retrieve traces using mlflow.search_traces.

After you've completed the notebook, and you already have an agent deployed on Databricks, locate the request_ids to be reviewed from the <model_name>_payload_request_logs inference table. The inference table is in the same Unity Catalog catalog and schema where the model was registered. Sample code for this is near the bottom of this notebook.

8

10

Run an evaluation

Agent Evaluation's built-in judges

Judges that run without ground-truth labels or retrieval in traces:
- guidelines: guidelines allows developers write plain-language checklists or rubrics in their evaluation, improving transparency and trust with business stakeholders through easy-to-understand, structured grading rubrics.
- safety: making sure the response is safe
- relevance_to_query: making sure the response is relevant
For traces with retrieved docs (spans of type RETRIEVER):
- groundedness: detect hallucinations
- chunk_relevance: chunk-level relevance to the query
Later, when we collect ground-truth labels using the Review app, we will benefit from two more judges:
- correctness: will be ignored until we collect labels like expected_facts
- context_sufficiency: will be ignored until we collect labels like expected_facts

See the full list of built-in judges (AWS | Azure) and how to run a subset of judges or customize judges (AWS | Azure).

Custom metrics

Check the quality of tool calling
- tool_calls_are_logical: assert that the selected tools in the trace were logical given the user's request.
- grounded_in_tool_outputs: assert that the LLM's responses are grounded in the outputs of the tools and not hallucinating
Measure the agent's cost & latency
- latency: extracts the latency from the MLflow trace
- cost: extracts the total tokens used and multiplies by the LLM token rate

This notebook creates custom metrics (AWS | Azure) that use Mosaic AI callable judges. Custom metrics can be any Python function. More examples: (AWS | Azure).

13

15

18

Collect expectations (ground-truth labels)

Now that we have improved our agent, we want to make sure that certain responses always get the facts right.

Using the review app (AWS | Azure), we will send our evals to a labeling session for our SMEs to provide:

expected_facts so we can benefit from the correctness (AWS | Azure) and context_sufficiency (AWS | Azure) judges.
guidelines so our SMEs can add additional plain language criteria for each question based on their business context. This will extend the guidelines we already have defined at a global level.
If they liked the response, so our stakeholders can have confidence that the new model is indeed better. We do this using a custom label schema.

Note: This labeling session uses pre-computed traces from our previous evaluation run, instead of a live agent. See the end of the notebook on how to deploy your agent to Databricks.

20

22

25

26

28

30

agent-evaluation-metrics-guidelines-review-app

Mosaic AI Agent Evaluation: Custom metrics, guidelines and domain expert labels

Requirements

Select a Unity Catalog schema

A simple tool calling agent

Select (pre)production logs

Load the traces into an evaluation dataset

Run an evaluation

Agent Evaluation's built-in judges

Custom metrics

Define the custom metrics

Run the evaluation

Detected issues

Fix issues and re-evaluate

Collect expectations (ground-truth labels)

Re-evaluation with the collected `expected_facts`

Optional: Deploying the Agent in Databricks

Log the agent as an MLflow model

Register the model to Unity Catalog and deploy

Label a live agent

Next steps

agent-evaluation-metrics-guidelines-review-app

Mosaic AI Agent Evaluation: Custom metrics, guidelines and domain expert labels

Requirements

Select a Unity Catalog schema

A simple tool calling agent

Select (pre)production logs

Load the traces into an evaluation dataset

Run an evaluation

Agent Evaluation's built-in judges

Custom metrics

Define the custom metrics

Run the evaluation

Detected issues

Fix issues and re-evaluate

Collect expectations (ground-truth labels)

Re-evaluation with the collected expected_facts

Optional: Deploying the Agent in Databricks

Log the agent as an MLflow model

Register the model to Unity Catalog and deploy

Label a live agent

Next steps

Re-evaluation with the collected `expected_facts`