Custom Metrics in Agent Eval(Python)

Loading...

Custom Metrics in Mosaic AI Agent Evaluation

This notebook will show you a few different ways to use Custom Metrics in Mosaic AI Agent Evaluation. For more information on custom metrics, see this guide. The API reference for the @metric decorator can be found here.

We currently support:

  1. boolean metrics
  2. float & integer metrics. These will be treated as ordinal values. The UI will let you sort by these values, and show averages along any slice.
  3. Pass/Fail metrics from callable judges.

There is also a section for best-practices when building metrics.

2

Custom boolean metrics

Simple heuristic: language-model self-reference

This metric will just check for "LLM" mentioned in the model response. If it mentions "LLM", it will return True.

5

Pass/Fail metrics & callable judges

Example: Check input requests are properly formatted

This metric checks if the arbitrary input is formatted as expected and returns True if it is.

8

Ensure the retrieved context has no PII

In this example, we will call the guideline adherence judge to ensure that the retrieved context has no PII.

10

Custom float metric

This example will use the built-in difflib to measure the similarity between the response and the expected_response and emit it as a float.

12

Use custom_expected to pass extra expected information to custom metrics

In this exmaple, we'll assert that the length of the response is within (min_length, max_length) bounds that we set per-example. We can use custom_expected to store any row-level information that will be passed to custom metrics when creating an assessment.

14

Compute multiple assessments with a single metric function

You can also compute multiple assessments with just a single metric function to re-use computation by returning an array of Assessment types.

16

Assertions over traces

Custom metrics can see the entire MLFlow Trace, so you can write metrics that measure internals of your application.

Example: request classification & routing

In this example, we will build an agent that simply determines whether the user query is a question or a statement and returns it in plain english to the user. In a more realistic scenario, you might use this technique to route queries to different functionality.

Our evaluation set will ensure that the query-type classifier produces the right results for a set of inputs by using custom metrics that inspect the MLFlow trace.

19

[Agent] Keyword Rag Agent

The rest of the docs below will use the Keyword Rag Agent defined in the hidden cells below to demonstrate a realistic example of using custom metrics. The details of this Agent will be described below in evaluating this example.

21
22

Developing metrics

When developing metrics, we want to be able to quickly iterate on the metric without having to call the harness and execute the Agent every time we make a change. To make this simpler we will use the following strategy:

  1. Generate an answer sheet from our eval dataset & agent. This executes the Agent for each of the entries in our evaluation set, generating responses & traces that we can use the call the metric directly.
  2. Define the metric.
  3. Call the metric for each value in the answer sheet & iterate on the metric definition.
  4. Once the metric is doing what we intend, we can run mlflow.evaluate() on the same answer sheet to verify that the harness & UI are what we expect. Here we disable the model= field so we use pre-computed responses.
  5. Re-enable the model= field so we call the Agent interactively for future calls to mlflow.evaluate().

In the example below, we are using the keyword rag agent defined in the hidden cells above to demonstrate this dev cycle.

24

Example: Make sure the right keywords are extracted from the prompt in our Keyword RAG

In this example, we will define a simple rag agent that:

  • Extracts salient keywords from the user query. The function that extracts keywords is a span of type PARSER
  • Finds chunks that contain these keywords.
  • Passes them to an LLM to generate a response.

We will assert that the keywords extracted are correct, and that the chunks contain the keywords.

While this example is a simple RAG, this approach can be generalized to any Agentic system.

See the hidden code cells below for the definition of the Agent.

26

Realistic example: custom metrics for a tool-calling Agent

In this example, we will define a simple rag agent that has access to two tools, add and multiply.

We will show two ways to determine whether the tool call choice was "correct".

While this example is a simple RAG, this approach can be generalized to any Agentic system.

See the hidden code cells below for the definition of the Agent.

Define the tool-calling agent

Approach 1: Define the expected tool to be chosen

In this approach, we explicitly define the tool name that we expect to be called and verify that the tool is called.

30

Approach 2: Judge whether the tool choice was reasonable

This approach uses the available_tools attribute which contains a list of possible tools that can be called. It then uses the guidelines judge to assess if the tool call choice is reasonable given the list of available tools.

32

Realistic example: multi-turn evaluation of a tool-calling agent

In this example, we use the same tool-calling agent as above.

We will show an example of how to extract the message history from the request and create a custom metric over it.

Helper to extract

35

36