databricks-logo

agent-evaluation-custom-metrics

(Python)
Loading...

Custom Metrics in Mosaic AI Agent Evaluation

This notebook will show you a few different ways to use Custom Metrics in Mosaic AI Agent Evaluation. For more information on custom metrics, see this guide. The API reference for the @metric decorator can be found here.

We currently support:

  1. boolean metrics
  2. float & integer metrics. These will be treated as ordinal values. The UI will let you sort by these values, and show averages along any slice.
  3. Pass/Fail metrics from callable judges.

There is also a section for best-practices when building metrics.

2
%pip install -U -qqqq mlflow databricks-agents>=0.20.0 retry databricks-langchain langchain-community langchain langgraph
dbutils.library.restartPython()

Custom boolean metrics

Simple heuristic: language-model self-reference

This metric will just check for "LLM" mentioned in the model response. If it mentions "LLM", it will return True.

import mlflow
import pandas as pd
from databricks.agents.evals import metric

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!"
  }, {
    "request": "Good afternoon",
    "response": "I am an LLM and I cannot answer that question."
  }
]

@metric
def response_mentions_llm(response):
  return "LLM" in response

with mlflow.start_run(run_name="response_mentions_llm"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_mentions_llm],
        # Disable built-in judges.
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

Pass/Fail metrics & callable judges

Example: Check input requests are properly formatted

This metric checks if the arbitrary input is formatted as expected and returns True if it is.

import mlflow
import pandas as pd
from databricks.agents.evals import metric

evals = [
  {
    "request": {"messages": [{"role": "user", "content": "Good morning"}]},
  }, {
    "request": {"inputs": ["Good afternoon"]},
  }, {
    "request": {"inputs": [1, 2, 3, 4]},
  }
]

@metric
def check_valid_format(request):
  # Check that the request contains a top-level key called "inputs" with a value of a list
  return "inputs" in request and isinstance(request.get("inputs"), list)


with mlflow.start_run(run_name="check_format"):
  eval_results = mlflow.evaluate(
      data=pd.DataFrame.from_records(evals),
      model_type="databricks-agent",
      extra_metrics=[check_valid_format],
      # Disable built-in judges.
      evaluator_config={
          'databricks-agent': {
              "metrics": [],
          }
      }
  )
eval_results.tables['eval_results']

Ensure the retrieved context has no PII

In this example, we will call the guideline adherence judge to ensure that the retrieved context has no PII.

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!",
    "retrieved_context": [{
      "content": "The email address is noreply@databricks.com",
    }],
  }, {
    "request": "Good afternoon",
    "response": "This is actually the morning!",
    "retrieved_context": [{
      "content": "fake retrieved context",
    }],
  }
]

@metric
def retrieved_context_no_pii(request, response, retrieved_context):
  retrieved_content = '\n'.join([c['content'] for c in retrieved_context])
  return judges.guideline_adherence(
    request=request,
    guidelines=[
      "The retrieved context must not contain personally identifiable information.",
    ],
    # This feature requires `databricks-agents>=0.20.0`
    guidelines_context={"retrieved_context": retrieved_content},
  )

with mlflow.start_run(run_name="safety"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[retrieved_context_no_pii],
        # Disable built-in judges.
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

Custom float metric

This example will use the built-in difflib to measure the similarity between the response and the expected_response and emit it as a float.

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from difflib import SequenceMatcher

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!",
    "expected_response": "Hello and good morning to you!"
  }, {
    "request": "Good afternoon",
    "response": "I am an LLM and I cannot answer that question.",
    "expected_response": "Good afternoon to you too!"
  }
]

@metric
def response_similarity(response, expected_response):
  s = SequenceMatcher(None, response, expected_response)
  return s.ratio()

with mlflow.start_run(run_name="response_similarity"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_similarity],
        # Disable built-in judges.
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

Use custom_expected to pass extra expected information to custom metrics

In this exmaple, we'll assert that the length of the response is within (min_length, max_length) bounds that we set per-example. We can use custom_expected to store any row-level information that will be passed to custom metrics when creating an assessment.

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

evals = [
  {
    "request": "Good morning",
    "response": "Good night.",
    "custom_expected": {
      "max_length": 100,
      "min_length": 3
    }
  }, {
    "request": "What is the date?",
    "response": "12/19/2024",
    "custom_expected": {
      "min_length": 10,
      "max_length": 20,
    }
  }
]

# Our custom metric will use the "min_length" and "max_length" from the "custom_expected" field.
@metric
def response_len_bounds(
  request,
  response,
  # This is the exact_expected_response from your eval dataframe.
  custom_expected 
):
  return len(response) <= custom_expected["max_length"] and len(response) >= custom_expected["min_length"]

with mlflow.start_run(run_name="response_len_bounds"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_len_bounds],
        # Disable built-in judges.
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

Compute multiple assessments with a single metric function

You can also compute multiple assessments with just a single metric function to re-use computation by returning an array of Assessment types.

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment

evals = [
  {
    "request": "Good morning",
    "response": "Good night!"
  }, {
    "request": "What is the date?",
    "response": "I dont know"
  },
{
    "request": "What is the date?",
    "response": "What do you mean?"
  }
]

@metric
def punctuation(request, response):
  return [
    Assessment(name='has_exclamation', value="!" in response),
    Assessment(name='has_period', value="." in response),
    Assessment(name='has_question_mark', value="?" in response),
  ]

with mlflow.start_run(run_name="multiple_assessments_single_metric"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[punctuation],
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

Assertions over traces

Custom metrics can see the entire MLFlow Trace, so you can write metrics that measure internals of your application.

Example: request classification & routing

In this example, we will build an agent that simply determines whether the user query is a question or a statement and returns it in plain english to the user. In a more realistic scenario, you might use this technique to route queries to different functionality.

Our evaluation set will ensure that the query-type classifier produces the right results for a set of inputs by using custom metrics that inspect the MLFlow trace.

import mlflow
import pandas as pd
from mlflow.types.llm import ChatCompletionResponse, ChatCompletionRequest
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment
from mlflow.entities import Trace
from mlflow.deployments import get_deploy_client

# This agent is a toy example that just returns simple statistics about the user's request.
# To get the stats about the request, the agent calls methods to compute stats before returning the stats in natural language.

deploy_client = get_deploy_client("databricks")
ENDPOINT_NAME="databricks-meta-llama-3-1-70b-instruct"

@mlflow.trace(name="classify_question_answer")
def classify_question_answer(request: str) -> str:
  system_prompt = """
    Return "question" if the request is formed as a question, even without correct punctuation.
    Return "statement" if the request is a statement, even without correct punctuation.
    Return "unknown" otherwise.

    Do not return a preamble, only return a single word.
  """
  request = {
    "messages": [
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": request},
    ],
    "temperature": .01,
    "max_tokens": 1000
  }

  result = deploy_client.predict(endpoint=ENDPOINT_NAME, inputs=request)
  return result.choices[0]['message']['content']

@mlflow.trace(name="agent", span_type="CHAIN")
def question_answer_agent(request: ChatCompletionRequest) -> ChatCompletionResponse:
    user_query = request["messages"][-1]["content"]

    request_type = classify_question_answer(user_query)
    response = f"The request is a {request_type}."
  
    return {
        "messages": [
            *request["messages"][:-1], # Keep the chat history.
            {"role": "user", "content": response}
        ]
    }

# We define our evaluation set with a set of requests and the expected request types for those requests.
evals = [
  {
    "request": "This is a question",
    "custom_expected": {
      "request_type": "statement"
    }
  }, {
    "request": "What is the date?",
    "custom_expected": {
      "request_type": "question"
    }
  },
]

# Our custom metric will check the expected request type against the actual request type produced by the Agent trace.
@metric
def correct_request_type(request, trace, custom_expected):
  classification_span = trace.search_spans(name="classify_question_answer")[0]
  return classification_span.outputs == custom_expected['request_type']

with mlflow.start_run(run_name="multiple_assessments_single_metric"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model=question_answer_agent,
        model_type="databricks-agent",
        extra_metrics=[correct_request_type],
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

[Agent] Keyword Rag Agent

The rest of the docs below will use the Keyword Rag Agent defined in the hidden cells below to demonstrate a realistic example of using custom metrics. The details of this Agent will be described below in evaluating this example.

import pandas as pd

# Read chunks from the cookbook's pre-chunks.
databricks_docs_url = "https://github.com/databricks/genai-cookbook/raw/refs/heads/main/quick_start_demo/chunked_databricks_docs.snappy.parquet"
CHUNKS = pd.read_parquet(databricks_docs_url)[:500].to_dict('records')
import mlflow
from mlflow.types.llm import ChatCompletionResponse, ChatCompletionRequest
from mlflow.deployments import get_deploy_client
import dataclasses

SYSTEM_PROMPT= """
  The Agent is a RAG chatbot that answers questions about Databricks. Questions unrelated to Databricks are irrelevant.
"""

PROMPT = """Given the following context
  {context}
  ###############
  Answer the following query to the best of your knowledge:
  {user_query}
"""
CONTEXT_LEN_CHARS = 4096 * 4

def prepend_system_prompt(request: ChatCompletionRequest, system_prompt: str) -> ChatCompletionRequest:
  if isinstance(request, ChatCompletionRequest):
    request = dataclasses.asdict(request)
  if request["messages"][0]["role"] != "system":
    return {
      **request,
      "messages": [
        {"role": "system", "content": system_prompt},
        *request["messages"]
      ]
    }
  return request


ENDPOINT_NAME="databricks-meta-llama-3-1-70b-instruct"
TEMPERATURE=0.01
MAX_TOKENS=1000
deploy_client = get_deploy_client("databricks")
@mlflow.trace(name="chat_completion", span_type="CHAT_MODEL")
def chat_completion(request: ChatCompletionRequest) -> ChatCompletionResponse:
  request = {**request, "temperature": TEMPERATURE, "max_tokens": MAX_TOKENS}
  return deploy_client.predict(endpoint=ENDPOINT_NAME, inputs=request)

@mlflow.trace(name="chain", span_type="CHAIN")
def rag_agent(request: ChatCompletionRequest) -> ChatCompletionResponse:
    request = prepend_system_prompt(request, SYSTEM_PROMPT)
    user_query = request["messages"][-1]["content"]
    keywords = extract_keywords(user_query)
    
    docs = retrieve_documents(keywords)
    context = "\n\n".join([doc["page_content"] for doc in docs])
    agent_query = PROMPT.format(context=context, user_query=user_query)
    return chat_completion({
        **request,
        "messages": [
            *request["messages"][:-1], # Keep the chat history.
            {"role": "user", "content": agent_query}
        ]
    })

@mlflow.trace(span_type="PARSER")
def extract_keywords(query: str) -> list[str]:
    prompt = f"""Given a user query, extract the most salient keywords from the user query. These keywords will be used in a search engine to retrieve relevant documents to the query.
    
    Example query: "What is Databricks Delta Live Tables?
    Example keywords: databricks,delta,live,table

    Query: {query}

    Respond only with the keywords and nothing else.
    """
    model_response = chat_completion({
        "messages": [{"role": "user", "content": prompt}]
    })
    return model_response.choices[0]["message"]["content"].split(",")

@mlflow.trace(span_type="RETRIEVER")
def retrieve_documents(keywords: list[str]) -> list[dict]:
    if len(keywords) == 0:
        return []
    result = []
    for chunk in CHUNKS:
        score = sum(
            (keyword.lower() in chunk["chunked_text"].lower()) for keyword in keywords
        )
        result.append({
            "page_content": chunk["chunked_text"],
            "metadata": {
                "doc_uri": chunk["url"],
                "score": score,
                "chunk_id": chunk["chunk_id"],
            },
        })
    ranked_docs = sorted(result, key=lambda x: x["metadata"]["score"], reverse=True)
    cutoff_docs = []
    context_budget_left = CONTEXT_LEN_CHARS
    for doc in ranked_docs:
        content = doc["page_content"]
        doc_len = len(content)
        if context_budget_left < doc_len:
            cutoff_docs.append({**doc, "page_content": content[:context_budget_left]})
            break
        else:
            cutoff_docs.append(doc)
        context_budget_left -= doc_len
    return cutoff_docs

Developing metrics

When developing metrics, we want to be able to quickly iterate on the metric without having to call the harness and execute the Agent every time we make a change. To make this simpler we will use the following strategy:

  1. Generate an answer sheet from our eval dataset & agent. This executes the Agent for each of the entries in our evaluation set, generating responses & traces that we can use the call the metric directly.
  2. Define the metric.
  3. Call the metric for each value in the answer sheet & iterate on the metric definition.
  4. Once the metric is doing what we intend, we can run mlflow.evaluate() on the same answer sheet to verify that the harness & UI are what we expect. Here we disable the model= field so we use pre-computed responses.
  5. Re-enable the model= field so we call the Agent interactively for future calls to mlflow.evaluate().

In the example below, we are using the keyword rag agent defined in the hidden cells above to demonstrate this dev cycle.

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment
from mlflow.entities import Trace

evals = [
  {
    "request": "What is Databricks?",
    "custom_expected": {
      "keywords": ["databricks"],
    },
    "expected_response": "Databricks is a cloud-based analytics platform.",
    "expected_facts": ["Databricks is a cloud-based analytics platform."],
    "expected_retrieved_context": [{"content": "Databricks is a cloud-based analytics platform.", "doc_uri": "https://databricks.com/doc_uri"}]
  }, {
    "request": "When was Databricks founded?",
    "custom_expected": {
      "keywords": ["when", "databricks", "founded"]
    },
    "expected_response": "Databricks was founded in 2012",
    "expected_facts": ["Databricks was founded in 2012"],
    "expected_retrieved_context": [{"content": "Databricks is a cloud-based analytics platform.", "doc_uri": "https://databricks.com/doc_uri"}]
  }, {
    "request": "How do I convert a timestamp_ms to a timestamp in dbsql?",
    "custom_expected": {
      "keywords": ["timestamp_ms", "timestamp", "dbsql"]
    },
    "expected_response": "You can convert a timestamp with...",
    "expected_facts": ["You can convert a timestamp with..."],
    "expected_retrieved_context": [{"content": "You can convert a timestamp with...", "doc_uri": "https://databricks.com/doc_uri"}]
  }
]
## Step 1: Generate an answer sheet with all of the built-in judges turned off.
## This will call the agent for all the rows in our evals, which we can use to build our metric.
answer_sheet_df = mlflow.evaluate(
  data=evals,
  model=rag_agent,
  model_type="databricks-agent",
  # Turn off built-in judges so we just build an answer sheet.
  evaluator_config={"databricks-agent": {"metrics": []}
  }
).tables['eval_results']
display(answer_sheet_df)

answer_sheet = answer_sheet_df.to_dict(orient='records')

## Step 2: Define our metric.
@metric
def custom_metric_consistency(
  request,
  response,
  retrieved_context,
  expected_response,
  expected_facts,
  expected_retrieved_context,
  trace,
  # This is the exact_expected_response from your eval dataframe.
  custom_expected 
):
  print(f"[custom_metric] request: {request}")
  print(f"[custom_metric] response: {response}")
  print(f"[custom_metric] retrieved_context: {retrieved_context}")
  print(f"[custom_metric] expected_response: {expected_response}")
  print(f"[custom_metric] expected_facts: {expected_facts}")
  print(f"[custom_metric] expected_retrieved_context: {expected_retrieved_context}")
  print(f"[custom_metric] trace: {trace}")

  return True

## Step 3: Call the metric directly before using the eval harness to iterate on the metric definition.
for row in answer_sheet:
  custom_metric_consistency(
    request=row['request'],
    response=row['response'],
    expected_response=row['expected_response'],
    expected_facts=row['expected_facts'],
    expected_retrieved_context=row['expected_retrieved_context'],
    retrieved_context=row['retrieved_context'],
    trace=Trace.from_json(row['trace']),
    custom_expected=row['custom_expected']
  )

## Step 4: Once we are confident in the signature of our metric, we can run the harness with the answer sheet to trigger the output validation & make sure the UI reflects what we intended.
with mlflow.start_run(run_name="exact_expected_response"):
    eval_results = mlflow.evaluate(
        data=answer_sheet,
        ## Step 5: Re-enable the model here to call the Agent when we are working on the Agent definition.
        # model=rag_agent,
        model_type="databricks-agent",
        extra_metrics=[custom_metric_consistency],
        # Uncomment to turn off built-in judges.
        # evaluator_config={
        #     'databricks-agent': {
        #         "metrics": [],
        #     }
        # }
    )
    display(eval_results.tables['eval_results'])

Example: Make sure the right keywords are extracted from the prompt in our Keyword RAG

In this example, we will define a simple rag agent that:

  • Extracts salient keywords from the user query. The function that extracts keywords is a span of type PARSER
  • Finds chunks that contain these keywords.
  • Passes them to an LLM to generate a response.

We will assert that the keywords extracted are correct, and that the chunks contain the keywords.

While this example is a simple RAG, this approach can be generalized to any Agentic system.

See the hidden code cells below for the definition of the Agent.

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment

# NOTE: We are passing the rag_agent to the evaluate() harness to generate responses & traces.
evals = [
  {
    "request": "What is Databricks?",
    "custom_expected": {
      "keywords": ["databricks"],
    }
  }, {
    "request": "When was Databricks founded?",
    "custom_expected": {
      "keywords": ["when", "databricks", "founded"]
    }
  }, {
    "request": "How do I convert a timestamp_ms to a timestamp in dbsql?",
    "custom_expected": {
      "keywords": ["timestamp_ms", "timestamp", "dbsql"]
    }
  }
]

# This metric will compute 3 assessments for
# a) whether there are extra keywords
# b) whether there are missing keywords
# c) whether the keywords are exactly correct.
@metric
def keywords_correct(
  request,
  response,
  trace,
  # This is the exact_expected_response from your eval dataframe.
  custom_expected 
):
  # Find the "PARSER" span outputs to get the keywords in the trace.
  parser_spans = trace.search_spans(span_type="PARSER")
  keywords = parser_spans[0].outputs

  # Find extra keywords and missing keywords.
  extra_keywords = []
  missing_keywords = []
  for keyword in custom_expected["keywords"]:
    if keyword not in keywords:
      missing_keywords.append(keyword)
  for keyword in keywords:
    if keyword not in custom_expected["keywords"]:
      extra_keywords.append(keyword)

  extra_keywords_rationale = f"Extra keywords in trace: `{', '.join(extra_keywords)}`" if extra_keywords else None
  missing_keywords_rationale = f"Missing keywords in trace: `{' '.join(missing_keywords)}`" if missing_keywords else None
  
  keywords_incorrect = bool(len(missing_keywords) or len(extra_keywords))
  keywords_incorrect_rationale = f"{extra_keywords_rationale or ''}\n{missing_keywords_rationale or ''}" if keywords_incorrect else None
  
  return [
    Assessment(name='has_extra_keywords', value=len(extra_keywords) > 0, rationale=extra_keywords_rationale),
    Assessment(name='has_missing_keywords', value=len(missing_keywords) > 0, rationale=missing_keywords_rationale),
    Assessment(name='keywords_incorrect', value=keywords_incorrect, rationale=keywords_incorrect_rationale)
  ]

with mlflow.start_run(run_name="keyword_agent"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model=rag_agent,
        model_type="databricks-agent",
        extra_metrics=[keywords_correct],
        # Uncomment to turn off built-in judges.
        # evaluator_config={
        #     'databricks-agent': {
        #         "metrics": [],
        #     }
        # }
    )
    display(eval_results.tables['eval_results'])

Realistic example: custom metrics for a tool-calling Agent

In this example, we will define a simple rag agent that has access to two tools, add and multiply.

We will show two ways to determine whether the tool call choice was "correct".

While this example is a simple RAG, this approach can be generalized to any Agentic system.

See the hidden code cells below for the definition of the Agent.

from typing import Annotated, List
from typing_extensions import TypedDict

import mlflow
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages

from langchain_core import tools
from databricks_langchain import ChatDatabricks

mlflow.langchain.autolog()

# Step 1: Define our tools
@tools.tool
def add(a: int, b: int) -> int:
  """Add two numbers."""
  return a + b

@tools.tool
def multiply(a: int, b: int) -> int:
  """Multiply two numbers."""
  return a * b

# Step 2: Define our agent. We do this in LangGraph because it provides handy abstractions to invoke (and re-invoke) tools.
class State(TypedDict):
  messages: Annotated[list, add_messages]

def create_tool_calling_agent(
  tools: List[tools.structured.StructuredTool] = [add, multiply],
  endpoint: str = "databricks-meta-llama-3-3-70b-instruct"
):
  # Define the LLM and bind it with the tools
  llm = ChatDatabricks(endpoint=endpoint)
  llm_with_tools = llm.bind_tools(tools)

  # Build the agent graph
  graph_builder = StateGraph(State)
  graph_builder.add_node(
      "chatbot",
      lambda state: {"messages": [llm_with_tools.invoke(state["messages"])]}
  )

  graph_builder.add_node("tools", ToolNode(tools=tools))
  graph_builder.add_conditional_edges(
    "chatbot",
    tools_condition,
  )
  # Any time a tool is called, we return to the chatbot to decide the next step
  graph_builder.add_edge("tools", "chatbot")
  graph_builder.add_edge(START, "chatbot")
  return graph_builder.compile()

agent = create_tool_calling_agent()

# Step 3: Create a helper to call our agent from `mlflow.evaluate`
def tool_calling_agent(model_input):
  response = agent.invoke(model_input)["messages"][-1].to_json()
  return response["kwargs"]["content"]

Approach 1: Define the expected tool to be chosen

In this approach, we explicitly define the tool name that we expect to be called and verify that the tool is called.

import mlflow
import pandas as pd
from databricks.agents.evals import metric

eval_data = pd.DataFrame(
  [
    {
      "request": "what is 3 * 12?",
      "expected_response": "36",
      "custom_expected": {
        "expected_tool_name": "multiply"
      },
    },
    {
      "request": "what is 3 + 12?",
      "expected_response": "15",
      "custom_expected": {
        "expected_tool_name": "add"
      },
    },
  ]
)

"""
`tool_calls` returns a `List[ToolCallInvocation]

@dataclasses.dataclass
class ToolCallInvocation:
    tool_name: str
    tool_call_args: Dict[str, Any]
    tool_call_id: Optional[str] = None
    tool_call_result: Optional[Dict[str, Any]] = None

    # Only available from the trace
    raw_span: Optional[mlflow_entities.Span] = None
    available_tools: Optional[List[Dict[str, Any]]] = None
"""

@metric
def is_correct_tool(tool_calls, custom_expected):
  # Metric to check whether the first tool call is the expected tool
  return tool_calls[0].tool_name == custom_expected["expected_tool_name"]

results = mlflow.evaluate( 
  data=eval_data,
  model=tool_calling_agent,
  model_type="databricks-agent",
  extra_metrics=[is_correct_tool]
)
results.tables["eval_results"].display()

Approach 2: Judge whether the tool choice was reasonable

This approach uses the available_tools attribute which contains a list of possible tools that can be called. It then uses the guidelines judge to assess if the tool call choice is reasonable given the list of available tools.

from databricks.agents.evals import judges

@metric
def is_reasonable_tool(request, trace, tool_calls):
  # Metric using the guideline adherence judge to determine whether the chosen tools are reasonable
  # given the set of available tools. Note that `guidelines_context` requires `databricks-agents >= 0.20.0`
  
  return judges.guideline_adherence(
    request=request["messages"][0]["content"],
    guidelines=[
      "The selected tool must be a reasonable tool call with respect to the request and available tools.",
    ],
    guidelines_context={
      "available_tools": str(tool_calls[0].available_tools),
      "chosen_tools": str([tool_call.tool_name for tool_call in tool_calls]),
    },
  )

results = mlflow.evaluate( 
  data=eval_data,
  model=tool_calling_agent,
  model_type="databricks-agent",
  extra_metrics=[is_reasonable_tool]
)
results.tables["eval_results"].display()

Realistic example: multi-turn evaluation of a tool-calling agent

In this example, we use the same tool-calling agent as above.

We will show an example of how to extract the message history from the request and create a custom metric over it.

from typing import Any, Dict, List, Tuple
import mlflow

_MESSAGES = "messages"


def extract_message_history(request: Dict[str, Any]) -> Tuple[List[Dict[str, str]], Dict[str, str]]:
    """
    Extract the message history from a request (i.e., all messages except the last). The following
    input formats are accepted (in a dictionary representation): ChatCompletionRequest, ChatModel,
    and ChatAgent.

    :param request: The request formatted as a dictionary
    :return: List of messages and the last message
    """
    if not isinstance(request, Dict):
        raise ValueError(f"Expected a dictionary, got {type(request)}")
    if _MESSAGES in request:
        return request[_MESSAGES][:-1], request[_MESSAGES][-1]
    raise ValueError(f"Invalid input: {request}")
# Dummy code to append a message to each of the evaluations in the last round of evaluation, for the sake of demonstration
import json

def create_multiturn_request(trace):
  trace_obj = mlflow.entities.Trace.from_json(trace)
  response = json.loads(trace_obj.data.response)
  response["messages"].append({"role": "user", "content": "Now multiply the result by 10"})
  return response

multiturn_eval_df = results.tables["eval_results"][["trace"]].assign(
  request=lambda df: df["trace"].apply(create_multiturn_request)
)[["request"]]
@metric
def num_previous_messages(request):
  # Metric to compute the number of messages in the message history
  previous_messages, _ = extract_message_history(request)
  return len(previous_messages)


results = mlflow.evaluate( 
  data=multiturn_eval_df,
  model=tool_calling_agent,
  model_type="databricks-agent",
  extra_metrics=[num_previous_messages]
)
results.tables["eval_results"].display()
;