Mosaic AI Agent Evaluation: Custom metrics, guidelines and domain expert labels
This notebook demonstrates how to evaluate a GenAI app using Agent Evaluation's proprietary LLM judges, custom metrics, and labels from domain experts. It demonstrates:
- Loading production logs (traces) into an evaluation dataset
- Running evaluation and doing root cause analysis
- Writing custom metrics to automatically detect quality issues
- Sending production logs for SMEs to label and evolve the evaluation dataset
To get your agent ready for pre-production, see the Mosaic AI agent demo notebook (AWS | Azure).
To learn more about Mosaic AI Agent Evaluation, see Databricks documentation (AWS | Azure).
Requirements
- See the requirements of Agent Evaluation (AWS | Azure)
- Serverless or classic cluster running Databricks Runtime 15.4 LTS or above, or Databricks Runtime for Machine Learning 15.4 LTS or above.
- CREATE TABLE access in a Unity Catalog Schema

Select a Unity Catalog schema
Ensure you have CREATE TABLE access in this schema. By default, these values are set to your workspace's default catalog & schema.
A simple tool calling agent
Below is a simple tool calling agent, built with LangGraph, that has 2 tools:
multiply
, which takes 2 numbers and multiplies themquery_docs
, which takes a set of keywords, and returns relevant docs about Databricks using keyword search.
For the purposes of this demo notebook, it not important how the Agent code works - this demo focuses on how to evaluate the Agent's quality.
Note: Agent Evaluation works with any GenAI app, no matter how it is built, as long as the app can accept a Dict[str, Any]
input and returns a Dict[str, Any]
output.
For more examples of tools to add to your agent, see Databricks documentation (AWS | Azure)
Select (pre)production logs
Since this is a demo notebook, we generate example production logs below to demonstrate the new features in Agent Evaluation. We call our agent directly and log traces in MLFlow.
NOTE: MLFlow tracing will visualize each trace (with pagination) in the cell output when you call your agent or retrieve traces using mlflow.search_traces
.
After you've completed the notebook, and you already have an agent deployed on Databricks, locate the request_ids
to be reviewed from the <model_name>_payload_request_logs
inference table. The inference table is in the same Unity Catalog catalog and schema where the model was registered. Sample code for this is near the bottom of this notebook.
Load the traces into an evaluation dataset
Important: Before running this cell, ensure the values of uc_catalog
and uc_schema
widgets are set to a Unity Catalog schema where you have CREATE TABLE permissions. Re-running this cell will re-create the evaluation dataset.
Run an evaluation
Agent Evaluation's built-in judges
- Judges that run without ground-truth labels or retrieval in traces:
guidelines
: guidelines allows developers write plain-language checklists or rubrics in their evaluation, improving transparency and trust with business stakeholders through easy-to-understand, structured grading rubrics.safety
: making sure the response is saferelevance_to_query
: making sure the response is relevant
- For traces with retrieved docs (spans of type
RETRIEVER
):groundedness
: detect hallucinationschunk_relevance
: chunk-level relevance to the query
- Later, when we collect ground-truth labels using the Review app, we will benefit from two more judges:
correctness
: will be ignored until we collect labels likeexpected_facts
context_sufficiency
: will be ignored until we collect labels likeexpected_facts
See the full list of built-in judges (AWS | Azure) and how to run a subset of judges or customize judges (AWS | Azure).
Custom metrics
- Check the quality of tool calling
tool_calls_are_logical
: assert that the selected tools in the trace were logical given the user's request.grounded_in_tool_outputs
: assert that the LLM's responses are grounded in the outputs of the tools and not hallucinating
- Measure the agent's cost & latency
latency
: extracts the latency from the MLflow tracecost
: extracts the total tokens used and multiplies by the LLM token rate
This notebook creates custom metrics (AWS | Azure) that use Mosaic AI callable judges. Custom metrics can be any Python function. More examples: (AWS | Azure).

Define the custom metrics
Run the evaluation
Detected issues
By looking at the evaluation results we see a couple of issues:
- The agent called the
multiply
tool when the query required summation. - The question about spark is not represented in our dataset, and the
chunk_relevance
judge caught this issue. - The LLM responds to pricing questions, which violates our guideline.
We also see that the agent correctly used the multiplication
tool and the query_docs
tool for the other 2 queries!
Fix issues and re-evaluate
Now that we have an evaluation set with judges we can try, let's attempt to fix the issues by:
- Improving our system prompt to let the agent know it's ok if no tools are being called
- Adding a doc to our knowledge base about latest spark version
- Add a new addition tool

Collect expectations (ground-truth labels)
Now that we have improved our agent, we want to make sure that certain responses always get the facts right.
Using the review app (AWS | Azure), we will send our evals to a labeling session for our SMEs to provide:
expected_facts
so we can benefit from thecorrectness
(AWS | Azure) andcontext_sufficiency
(AWS | Azure) judges.guidelines
so our SMEs can add additional plain language criteria for each question based on their business context. This will extend the guidelines we already have defined at a global level.- If they liked the response, so our stakeholders can have confidence that the new model is indeed better. We do this using a custom label schema.
Note: This labeling session uses pre-computed traces from our previous evaluation run, instead of a live agent. See the end of the notebook on how to deploy your agent to Databricks.

Re-evaluation with the collected expected_facts
After the SMEs are done with the labeling, we will sync the labels into our evaluation dataset and re-evaluate. Note that the correctness
judge should run for any eval row with expected_facts
.
Optional: Deploying the Agent in Databricks
Log the agent as an MLflow model
Store the latest agent into a standalone agent.py
file and log it as code. See MLflow - Models from Code.
Register the model to Unity Catalog and deploy
Label a live agent
Let's create another labeling session that talks to our newly deployed agent. Instead of adding traces, we will add our evaluation dataset into the session. By calling add_agent()
, we also enable the Review App's live chat mode, which allows users to have an open-ended chat with your agent.
