databricks-logo

agent-evaluation-metrics-guidelines-review-app

(Python)
Loading...

Mosaic AI Agent Evaluation: Custom metrics, guidelines and domain expert labels

This notebook demonstrates how to evaluate a GenAI app using Agent Evaluation's proprietary LLM judges, custom metrics, and labels from domain experts. It demonstrates:

  • Loading production logs (traces) into an evaluation dataset
  • Running evaluation and doing root cause analysis
  • Writing custom metrics to automatically detect quality issues
  • Sending production logs for SMEs to label and evolve the evaluation dataset

To get your agent ready for pre-production, see the Mosaic AI agent demo notebook (AWS | Azure).

To learn more about Mosaic AI Agent Evaluation, see Databricks documentation (AWS | Azure).

Requirements

  • See the requirements of Agent Evaluation (AWS | Azure)
  • Serverless or classic cluster running Databricks Runtime 15.4 LTS or above, or Databricks Runtime for Machine Learning 15.4 LTS or above.
  • CREATE TABLE access in a Unity Catalog Schema
2

Select a Unity Catalog schema

Ensure you have CREATE TABLE access in this schema. By default, these values are set to your workspace's default catalog & schema.

4

A simple tool calling agent

Below is a simple tool calling agent, built with LangGraph, that has 2 tools:

  1. multiply, which takes 2 numbers and multiplies them
  2. query_docs, which takes a set of keywords, and returns relevant docs about Databricks using keyword search.

For the purposes of this demo notebook, it not important how the Agent code works - this demo focuses on how to evaluate the Agent's quality.

Note: Agent Evaluation works with any GenAI app, no matter how it is built, as long as the app can accept a Dict[str, Any] input and returns a Dict[str, Any] output.

For more examples of tools to add to your agent, see Databricks documentation (AWS | Azure)

6

    Select (pre)production logs

    Since this is a demo notebook, we generate example production logs below to demonstrate the new features in Agent Evaluation. We call our agent directly and log traces in MLFlow.

    NOTE: MLFlow tracing will visualize each trace (with pagination) in the cell output when you call your agent or retrieve traces using mlflow.search_traces.

    After you've completed the notebook, and you already have an agent deployed on Databricks, locate the request_ids to be reviewed from the <model_name>_payload_request_logs inference table. The inference table is in the same Unity Catalog catalog and schema where the model was registered. Sample code for this is near the bottom of this notebook.

    8

    Load the traces into an evaluation dataset

    Important: Before running this cell, ensure the values of uc_catalog and uc_schema widgets are set to a Unity Catalog schema where you have CREATE TABLE permissions. Re-running this cell will re-create the evaluation dataset.

    10

    Run an evaluation

    Agent Evaluation's built-in judges
    • Judges that run without ground-truth labels or retrieval in traces:
      • guidelines: guidelines allows developers write plain-language checklists or rubrics in their evaluation, improving transparency and trust with business stakeholders through easy-to-understand, structured grading rubrics.
      • safety: making sure the response is safe
      • relevance_to_query: making sure the response is relevant
    • For traces with retrieved docs (spans of type RETRIEVER):
      • groundedness: detect hallucinations
      • chunk_relevance: chunk-level relevance to the query
    • Later, when we collect ground-truth labels using the Review app, we will benefit from two more judges:
      • correctness: will be ignored until we collect labels like expected_facts
      • context_sufficiency: will be ignored until we collect labels like expected_facts

    See the full list of built-in judges (AWS | Azure) and how to run a subset of judges or customize judges (AWS | Azure).

    Custom metrics
    • Check the quality of tool calling
      • tool_calls_are_logical: assert that the selected tools in the trace were logical given the user's request.
      • grounded_in_tool_outputs: assert that the LLM's responses are grounded in the outputs of the tools and not hallucinating
    • Measure the agent's cost & latency
      • latency: extracts the latency from the MLflow trace
      • cost: extracts the total tokens used and multiplies by the LLM token rate

    This notebook creates custom metrics (AWS | Azure) that use Mosaic AI callable judges. Custom metrics can be any Python function. More examples: (AWS | Azure).

    Define the custom metrics

    13

    Run the evaluation

    15

    Detected issues

    By looking at the evaluation results we see a couple of issues:

    • The agent called the multiply tool when the query required summation.
    • The question about spark is not represented in our dataset, and the chunk_relevance judge caught this issue.
    • The LLM responds to pricing questions, which violates our guideline.

    We also see that the agent correctly used the multiplication tool and the query_docs tool for the other 2 queries!

    Fix issues and re-evaluate

    Now that we have an evaluation set with judges we can try, let's attempt to fix the issues by:

    • Improving our system prompt to let the agent know it's ok if no tools are being called
    • Adding a doc to our knowledge base about latest spark version
    • Add a new addition tool
    18

    Collect expectations (ground-truth labels)

    Now that we have improved our agent, we want to make sure that certain responses always get the facts right.

    Using the review app (AWS | Azure), we will send our evals to a labeling session for our SMEs to provide:

    • expected_facts so we can benefit from the correctness (AWS | Azure) and context_sufficiency (AWS | Azure) judges.
    • guidelines so our SMEs can add additional plain language criteria for each question based on their business context. This will extend the guidelines we already have defined at a global level.
    • If they liked the response, so our stakeholders can have confidence that the new model is indeed better. We do this using a custom label schema.

    Note: This labeling session uses pre-computed traces from our previous evaluation run, instead of a live agent. See the end of the notebook on how to deploy your agent to Databricks.

    20

    Re-evaluation with the collected expected_facts

    After the SMEs are done with the labeling, we will sync the labels into our evaluation dataset and re-evaluate. Note that the correctness judge should run for any eval row with expected_facts.

    22

    Optional: Deploying the Agent in Databricks

    Log the agent as an MLflow model

    Store the latest agent into a standalone agent.py file and log it as code. See MLflow - Models from Code.

    25

      26

      Register the model to Unity Catalog and deploy

      28

      Label a live agent

      Let's create another labeling session that talks to our newly deployed agent. Instead of adding traces, we will add our evaluation dataset into the session. By calling add_agent(), we also enable the Review App's live chat mode, which allows users to have an open-ended chat with your agent.

      30

      Next steps

      After your agent is deployed, you can:

      • Chat with it in AI Playground (AWS | Azure).
      • In the review app (AWS | Azure), try the following:
        • Collect general feedback using 'Chat with the bot'
        • Collect labels from SMEs in a labeling session
      • Use it in your production application (AWS | Azure).
      ;