Custom Metrics in Mosaic AI Agent Evaluation
This notebook will show you a few different ways to use Custom Metrics in Mosaic AI Agent Evaluation. For more information on custom metrics, see this guide. The API reference for the @metric
decorator can be found here.
We currently support:
- boolean metrics
- float & integer metrics. These will be treated as ordinal values. The UI will let you sort by these values, and show averages along any slice.
- Pass/Fail metrics from callable judges.
There is also a section for best-practices when building metrics.
Custom boolean metrics
Simple heuristic: language-model self-reference
This metric will just check for "LLM" mentioned in the model response. If it mentions "LLM", it will return True
.
Pass/Fail metrics & callable judges
Example: Check input requests are properly formatted
This metric checks if the arbitrary input is formatted as expected and returns True
if it is.
Ensure the retrieved context has no PII
In this example, we will call the guideline adherence judge to ensure that the retrieved context has no PII.
Custom float metric
This example will use the built-in difflib to measure the similarity between the response
and the expected_response
and emit it as a float.
Use custom_expected to pass extra expected information to custom metrics
In this exmaple, we'll assert that the length of the response is within (min_length, max_length) bounds that we set per-example. We can use custom_expected
to store any row-level information that will be passed to custom metrics when creating an assessment.
Compute multiple assessments with a single metric function
You can also compute multiple assessments with just a single metric function to re-use computation by returning an array of Assessment types.
Assertions over traces
Custom metrics can see the entire MLFlow Trace, so you can write metrics that measure internals of your application.
Example: request classification & routing
In this example, we will build an agent that simply determines whether the user query is a question or a statement and returns it in plain english to the user. In a more realistic scenario, you might use this technique to route queries to different functionality.
Our evaluation set will ensure that the query-type classifier produces the right results for a set of inputs by using custom metrics that inspect the MLFlow trace.
[Agent] Keyword Rag Agent
The rest of the docs below will use the Keyword Rag Agent defined in the hidden cells below to demonstrate a realistic example of using custom metrics. The details of this Agent will be described below in evaluating this example.
Developing metrics
When developing metrics, we want to be able to quickly iterate on the metric without having to call the harness and execute the Agent every time we make a change. To make this simpler we will use the following strategy:
- Generate an answer sheet from our eval dataset & agent. This executes the Agent for each of the entries in our evaluation set, generating responses & traces that we can use the call the metric directly.
- Define the metric.
- Call the metric for each value in the answer sheet & iterate on the metric definition.
- Once the metric is doing what we intend, we can run
mlflow.evaluate()
on the same answer sheet to verify that the harness & UI are what we expect. Here we disable themodel=
field so we use pre-computed responses. - Re-enable the
model=
field so we call the Agent interactively for future calls to mlflow.evaluate().
In the example below, we are using the keyword rag agent defined in the hidden cells above to demonstrate this dev cycle.
Example: Make sure the right keywords are extracted from the prompt in our Keyword RAG
In this example, we will define a simple rag agent that:
- Extracts salient keywords from the user query. The function that extracts keywords is a span of type
PARSER
- Finds chunks that contain these keywords.
- Passes them to an LLM to generate a response.
We will assert that the keywords extracted are correct, and that the chunks contain the keywords.
While this example is a simple RAG, this approach can be generalized to any Agentic system.
See the hidden code cells below for the definition of the Agent.
Realistic example: custom metrics for a tool-calling Agent
In this example, we will define a simple rag agent that has access to two tools, add
and multiply
.
We will show two ways to determine whether the tool call choice was "correct".
While this example is a simple RAG, this approach can be generalized to any Agentic system.
See the hidden code cells below for the definition of the Agent.
Approach 1: Define the expected tool to be chosen
In this approach, we explicitly define the tool name that we expect to be called and verify that the tool is called.
Approach 2: Judge whether the tool choice was reasonable
This approach uses the available_tools
attribute which contains a list of possible tools that can be called. It then uses the guidelines judge to assess if the tool call choice is reasonable given the list of available tools.
Realistic example: multi-turn evaluation of a tool-calling agent
In this example, we use the same tool-calling agent as above.
We will show an example of how to extract the message history from the request and create a custom metric over it.