%md # Custom Metrics in Mosaic AI Agent Evaluation This notebook will show you a few different ways to use Custom Metrics in Mosaic AI Agent Evaluation. For more information on custom metrics, see [this guide](https://docs.databricks.com/en/generative-ai/agent-evaluation/custom-metrics.html). The API reference for the `@metric` decorator can be found [here](https://api-docs.databricks.com/python/databricks-agents/latest/databricks_agent_eval.html#databricks.agents.evals.metric). We currently support: 1. boolean metrics 2. float & integer metrics. These will be treated as ordinal values. The UI will let you sort by these values, and show averages along any slice. 3. Pass/Fail metrics from callable judges. There is also a section for best-practices when building metrics.
Custom Metrics in Mosaic AI Agent Evaluation
This notebook will show you a few different ways to use Custom Metrics in Mosaic AI Agent Evaluation. For more information on custom metrics, see this guide. The API reference for the @metric
decorator can be found here.
We currently support:
- boolean metrics
- float & integer metrics. These will be treated as ordinal values. The UI will let you sort by these values, and show averages along any slice.
- Pass/Fail metrics from callable judges.
There is also a section for best-practices when building metrics.
%pip install -U -qqqq mlflow databricks-agents>=0.20.0 retry databricks-langchain langchain-community langchain langgraph dbutils.library.restartPython()
%md ## Custom boolean metrics
Custom boolean metrics
%md ### Simple heuristic: language-model self-reference This metric will just check for "LLM" mentioned in the model response. If it mentions "LLM", it will return `True`.
Simple heuristic: language-model self-reference
This metric will just check for "LLM" mentioned in the model response. If it mentions "LLM", it will return True
.
import mlflow import pandas as pd from databricks.agents.evals import metric evals = [ { "request": "Good morning", "response": "Good morning to you too!" }, { "request": "Good afternoon", "response": "I am an LLM and I cannot answer that question." } ] @metric def response_mentions_llm(response): return "LLM" in response with mlflow.start_run(run_name="response_mentions_llm"): eval_results = mlflow.evaluate( data=pd.DataFrame.from_records(evals), model_type="databricks-agent", extra_metrics=[response_mentions_llm], # Disable built-in judges. evaluator_config={ 'databricks-agent': { "metrics": [], } } ) display(eval_results.tables['eval_results'])
%md ## Pass/Fail metrics & callable judges
Pass/Fail metrics & callable judges
%md ### Example: Check input requests are properly formatted This metric checks if the arbitrary input is formatted as expected and returns `True` if it is.
Example: Check input requests are properly formatted
This metric checks if the arbitrary input is formatted as expected and returns True
if it is.
import mlflow import pandas as pd from databricks.agents.evals import metric evals = [ { "request": {"messages": [{"role": "user", "content": "Good morning"}]}, }, { "request": {"inputs": ["Good afternoon"]}, }, { "request": {"inputs": [1, 2, 3, 4]}, } ] @metric def check_valid_format(request): # Check that the request contains a top-level key called "inputs" with a value of a list return "inputs" in request and isinstance(request.get("inputs"), list) with mlflow.start_run(run_name="check_format"): eval_results = mlflow.evaluate( data=pd.DataFrame.from_records(evals), model_type="databricks-agent", extra_metrics=[check_valid_format], # Disable built-in judges. evaluator_config={ 'databricks-agent': { "metrics": [], } } ) eval_results.tables['eval_results']
%md ### Ensure the retrieved context has no PII In this example, we will call the guideline adherence judge to ensure that the retrieved context has no PII.
Ensure the retrieved context has no PII
In this example, we will call the guideline adherence judge to ensure that the retrieved context has no PII.
import mlflow import pandas as pd from databricks.agents.evals import metric from databricks.agents.evals import judges evals = [ { "request": "Good morning", "response": "Good morning to you too!", "retrieved_context": [{ "content": "The email address is noreply@databricks.com", }], }, { "request": "Good afternoon", "response": "This is actually the morning!", "retrieved_context": [{ "content": "fake retrieved context", }], } ] @metric def retrieved_context_no_pii(request, response, retrieved_context): retrieved_content = '\n'.join([c['content'] for c in retrieved_context]) return judges.guideline_adherence( request=request, guidelines=[ "The retrieved context must not contain personally identifiable information.", ], # This feature requires `databricks-agents>=0.20.0` guidelines_context={"retrieved_context": retrieved_content}, ) with mlflow.start_run(run_name="safety"): eval_results = mlflow.evaluate( data=pd.DataFrame.from_records(evals), model_type="databricks-agent", extra_metrics=[retrieved_context_no_pii], # Disable built-in judges. evaluator_config={ 'databricks-agent': { "metrics": [], } } ) display(eval_results.tables['eval_results'])
%md ## Custom float metric This example will use the built-in difflib to measure the similarity between the `response` and the `expected_response` and emit it as a float.
Custom float metric
This example will use the built-in difflib to measure the similarity between the response
and the expected_response
and emit it as a float.
import mlflow import pandas as pd from databricks.agents.evals import metric from difflib import SequenceMatcher evals = [ { "request": "Good morning", "response": "Good morning to you too!", "expected_response": "Hello and good morning to you!" }, { "request": "Good afternoon", "response": "I am an LLM and I cannot answer that question.", "expected_response": "Good afternoon to you too!" } ] @metric def response_similarity(response, expected_response): s = SequenceMatcher(None, response, expected_response) return s.ratio() with mlflow.start_run(run_name="response_similarity"): eval_results = mlflow.evaluate( data=pd.DataFrame.from_records(evals), model_type="databricks-agent", extra_metrics=[response_similarity], # Disable built-in judges. evaluator_config={ 'databricks-agent': { "metrics": [], } } ) display(eval_results.tables['eval_results'])
%md ## Use custom_expected to pass extra expected information to custom metrics In this exmaple, we'll assert that the length of the response is within (min_length, max_length) bounds that we set per-example. We can use `custom_expected` to store any row-level information that will be passed to custom metrics when creating an assessment.
Use custom_expected to pass extra expected information to custom metrics
In this exmaple, we'll assert that the length of the response is within (min_length, max_length) bounds that we set per-example. We can use custom_expected
to store any row-level information that will be passed to custom metrics when creating an assessment.
import mlflow import pandas as pd from databricks.agents.evals import metric from databricks.agents.evals import judges evals = [ { "request": "Good morning", "response": "Good night.", "custom_expected": { "max_length": 100, "min_length": 3 } }, { "request": "What is the date?", "response": "12/19/2024", "custom_expected": { "min_length": 10, "max_length": 20, } } ] # Our custom metric will use the "min_length" and "max_length" from the "custom_expected" field. @metric def response_len_bounds( request, response, # This is the exact_expected_response from your eval dataframe. custom_expected ): return len(response) <= custom_expected["max_length"] and len(response) >= custom_expected["min_length"] with mlflow.start_run(run_name="response_len_bounds"): eval_results = mlflow.evaluate( data=pd.DataFrame.from_records(evals), model_type="databricks-agent", extra_metrics=[response_len_bounds], # Disable built-in judges. evaluator_config={ 'databricks-agent': { "metrics": [], } } ) display(eval_results.tables['eval_results'])
%md ## Compute multiple assessments with a single metric function You can also compute multiple assessments with just a single metric function to re-use computation by returning an array of Assessment types.
Compute multiple assessments with a single metric function
You can also compute multiple assessments with just a single metric function to re-use computation by returning an array of Assessment types.
import mlflow import pandas as pd from databricks.agents.evals import metric from databricks.agents.evals import judges from mlflow.evaluation import Assessment evals = [ { "request": "Good morning", "response": "Good night!" }, { "request": "What is the date?", "response": "I dont know" }, { "request": "What is the date?", "response": "What do you mean?" } ] @metric def punctuation(request, response): return [ Assessment(name='has_exclamation', value="!" in response), Assessment(name='has_period', value="." in response), Assessment(name='has_question_mark', value="?" in response), ] with mlflow.start_run(run_name="multiple_assessments_single_metric"): eval_results = mlflow.evaluate( data=pd.DataFrame.from_records(evals), model_type="databricks-agent", extra_metrics=[punctuation], evaluator_config={ 'databricks-agent': { "metrics": [], } } ) display(eval_results.tables['eval_results'])
%md ## Assertions over traces Custom metrics can see the entire MLFlow Trace, so you can write metrics that measure internals of your application.
Assertions over traces
Custom metrics can see the entire MLFlow Trace, so you can write metrics that measure internals of your application.
%md ## Example: request classification & routing In this example, we will build an agent that simply determines whether the user query is a question or a statement and returns it in plain english to the user. In a more realistic scenario, you might use this technique to route queries to different functionality. Our evaluation set will ensure that the query-type classifier produces the right results for a set of inputs by using custom metrics that inspect the MLFlow trace.
Example: request classification & routing
In this example, we will build an agent that simply determines whether the user query is a question or a statement and returns it in plain english to the user. In a more realistic scenario, you might use this technique to route queries to different functionality.
Our evaluation set will ensure that the query-type classifier produces the right results for a set of inputs by using custom metrics that inspect the MLFlow trace.
import mlflow import pandas as pd from mlflow.types.llm import ChatCompletionResponse, ChatCompletionRequest from databricks.agents.evals import metric from databricks.agents.evals import judges from mlflow.evaluation import Assessment from mlflow.entities import Trace from mlflow.deployments import get_deploy_client # This agent is a toy example that just returns simple statistics about the user's request. # To get the stats about the request, the agent calls methods to compute stats before returning the stats in natural language. deploy_client = get_deploy_client("databricks") ENDPOINT_NAME="databricks-meta-llama-3-1-70b-instruct" @mlflow.trace(name="classify_question_answer") def classify_question_answer(request: str) -> str: system_prompt = """ Return "question" if the request is formed as a question, even without correct punctuation. Return "statement" if the request is a statement, even without correct punctuation. Return "unknown" otherwise. Do not return a preamble, only return a single word. """ request = { "messages": [ {"role": "system", "content": system_prompt}, {"role": "user", "content": request}, ], "temperature": .01, "max_tokens": 1000 } result = deploy_client.predict(endpoint=ENDPOINT_NAME, inputs=request) return result.choices[0]['message']['content'] @mlflow.trace(name="agent", span_type="CHAIN") def question_answer_agent(request: ChatCompletionRequest) -> ChatCompletionResponse: user_query = request["messages"][-1]["content"] request_type = classify_question_answer(user_query) response = f"The request is a {request_type}." return { "messages": [ *request["messages"][:-1], # Keep the chat history. {"role": "user", "content": response} ] } # We define our evaluation set with a set of requests and the expected request types for those requests. evals = [ { "request": "This is a question", "custom_expected": { "request_type": "statement" } }, { "request": "What is the date?", "custom_expected": { "request_type": "question" } }, ] # Our custom metric will check the expected request type against the actual request type produced by the Agent trace. @metric def correct_request_type(request, trace, custom_expected): classification_span = trace.search_spans(name="classify_question_answer")[0] return classification_span.outputs == custom_expected['request_type'] with mlflow.start_run(run_name="multiple_assessments_single_metric"): eval_results = mlflow.evaluate( data=pd.DataFrame.from_records(evals), model=question_answer_agent, model_type="databricks-agent", extra_metrics=[correct_request_type], evaluator_config={ 'databricks-agent': { "metrics": [], } } ) display(eval_results.tables['eval_results'])
%md ## [Agent] Keyword Rag Agent The rest of the docs below will use the Keyword Rag Agent defined in the hidden cells below to demonstrate a realistic example of using custom metrics. The details of this Agent will be described below in evaluating this example.
[Agent] Keyword Rag Agent
The rest of the docs below will use the Keyword Rag Agent defined in the hidden cells below to demonstrate a realistic example of using custom metrics. The details of this Agent will be described below in evaluating this example.
import pandas as pd # Read chunks from the cookbook's pre-chunks. databricks_docs_url = "https://github.com/databricks/genai-cookbook/raw/refs/heads/main/quick_start_demo/chunked_databricks_docs.snappy.parquet" CHUNKS = pd.read_parquet(databricks_docs_url)[:500].to_dict('records')
import mlflow from mlflow.types.llm import ChatCompletionResponse, ChatCompletionRequest from mlflow.deployments import get_deploy_client import dataclasses SYSTEM_PROMPT= """ The Agent is a RAG chatbot that answers questions about Databricks. Questions unrelated to Databricks are irrelevant. """ PROMPT = """Given the following context {context} ############### Answer the following query to the best of your knowledge: {user_query} """ CONTEXT_LEN_CHARS = 4096 * 4 def prepend_system_prompt(request: ChatCompletionRequest, system_prompt: str) -> ChatCompletionRequest: if isinstance(request, ChatCompletionRequest): request = dataclasses.asdict(request) if request["messages"][0]["role"] != "system": return { **request, "messages": [ {"role": "system", "content": system_prompt}, *request["messages"] ] } return request ENDPOINT_NAME="databricks-meta-llama-3-1-70b-instruct" TEMPERATURE=0.01 MAX_TOKENS=1000 deploy_client = get_deploy_client("databricks") @mlflow.trace(name="chat_completion", span_type="CHAT_MODEL") def chat_completion(request: ChatCompletionRequest) -> ChatCompletionResponse: request = {**request, "temperature": TEMPERATURE, "max_tokens": MAX_TOKENS} return deploy_client.predict(endpoint=ENDPOINT_NAME, inputs=request) @mlflow.trace(name="chain", span_type="CHAIN") def rag_agent(request: ChatCompletionRequest) -> ChatCompletionResponse: request = prepend_system_prompt(request, SYSTEM_PROMPT) user_query = request["messages"][-1]["content"] keywords = extract_keywords(user_query) docs = retrieve_documents(keywords) context = "\n\n".join([doc["page_content"] for doc in docs]) agent_query = PROMPT.format(context=context, user_query=user_query) return chat_completion({ **request, "messages": [ *request["messages"][:-1], # Keep the chat history. {"role": "user", "content": agent_query} ] }) @mlflow.trace(span_type="PARSER") def extract_keywords(query: str) -> list[str]: prompt = f"""Given a user query, extract the most salient keywords from the user query. These keywords will be used in a search engine to retrieve relevant documents to the query. Example query: "What is Databricks Delta Live Tables? Example keywords: databricks,delta,live,table Query: {query} Respond only with the keywords and nothing else. """ model_response = chat_completion({ "messages": [{"role": "user", "content": prompt}] }) return model_response.choices[0]["message"]["content"].split(",") @mlflow.trace(span_type="RETRIEVER") def retrieve_documents(keywords: list[str]) -> list[dict]: if len(keywords) == 0: return [] result = [] for chunk in CHUNKS: score = sum( (keyword.lower() in chunk["chunked_text"].lower()) for keyword in keywords ) result.append({ "page_content": chunk["chunked_text"], "metadata": { "doc_uri": chunk["url"], "score": score, "chunk_id": chunk["chunk_id"], }, }) ranked_docs = sorted(result, key=lambda x: x["metadata"]["score"], reverse=True) cutoff_docs = [] context_budget_left = CONTEXT_LEN_CHARS for doc in ranked_docs: content = doc["page_content"] doc_len = len(content) if context_budget_left < doc_len: cutoff_docs.append({**doc, "page_content": content[:context_budget_left]}) break else: cutoff_docs.append(doc) context_budget_left -= doc_len return cutoff_docs
%md ## Developing metrics When developing metrics, we want to be able to quickly iterate on the metric without having to call the harness and execute the Agent every time we make a change. To make this simpler we will use the following strategy: 1. Generate an answer sheet from our eval dataset & agent. This executes the Agent for each of the entries in our evaluation set, generating responses & traces that we can use the call the metric directly. 2. Define the metric. 3. Call the metric for each value in the answer sheet & iterate on the metric definition. 4. Once the metric is doing what we intend, we can run `mlflow.evaluate()` on the same answer sheet to verify that the harness & UI are what we expect. Here we disable the `model=` field so we use pre-computed responses. 5. Re-enable the `model=` field so we call the Agent interactively for future calls to mlflow.evaluate(). In the example below, we are using the keyword rag agent defined in the hidden cells above to demonstrate this dev cycle.
Developing metrics
When developing metrics, we want to be able to quickly iterate on the metric without having to call the harness and execute the Agent every time we make a change. To make this simpler we will use the following strategy:
- Generate an answer sheet from our eval dataset & agent. This executes the Agent for each of the entries in our evaluation set, generating responses & traces that we can use the call the metric directly.
- Define the metric.
- Call the metric for each value in the answer sheet & iterate on the metric definition.
- Once the metric is doing what we intend, we can run
mlflow.evaluate()
on the same answer sheet to verify that the harness & UI are what we expect. Here we disable themodel=
field so we use pre-computed responses. - Re-enable the
model=
field so we call the Agent interactively for future calls to mlflow.evaluate().
In the example below, we are using the keyword rag agent defined in the hidden cells above to demonstrate this dev cycle.
import mlflow import pandas as pd from databricks.agents.evals import metric from databricks.agents.evals import judges from mlflow.evaluation import Assessment from mlflow.entities import Trace evals = [ { "request": "What is Databricks?", "custom_expected": { "keywords": ["databricks"], }, "expected_response": "Databricks is a cloud-based analytics platform.", "expected_facts": ["Databricks is a cloud-based analytics platform."], "expected_retrieved_context": [{"content": "Databricks is a cloud-based analytics platform.", "doc_uri": "https://databricks.com/doc_uri"}] }, { "request": "When was Databricks founded?", "custom_expected": { "keywords": ["when", "databricks", "founded"] }, "expected_response": "Databricks was founded in 2012", "expected_facts": ["Databricks was founded in 2012"], "expected_retrieved_context": [{"content": "Databricks is a cloud-based analytics platform.", "doc_uri": "https://databricks.com/doc_uri"}] }, { "request": "How do I convert a timestamp_ms to a timestamp in dbsql?", "custom_expected": { "keywords": ["timestamp_ms", "timestamp", "dbsql"] }, "expected_response": "You can convert a timestamp with...", "expected_facts": ["You can convert a timestamp with..."], "expected_retrieved_context": [{"content": "You can convert a timestamp with...", "doc_uri": "https://databricks.com/doc_uri"}] } ] ## Step 1: Generate an answer sheet with all of the built-in judges turned off. ## This will call the agent for all the rows in our evals, which we can use to build our metric. answer_sheet_df = mlflow.evaluate( data=evals, model=rag_agent, model_type="databricks-agent", # Turn off built-in judges so we just build an answer sheet. evaluator_config={"databricks-agent": {"metrics": []} } ).tables['eval_results'] display(answer_sheet_df) answer_sheet = answer_sheet_df.to_dict(orient='records') ## Step 2: Define our metric. @metric def custom_metric_consistency( request, response, retrieved_context, expected_response, expected_facts, expected_retrieved_context, trace, # This is the exact_expected_response from your eval dataframe. custom_expected ): print(f"[custom_metric] request: {request}") print(f"[custom_metric] response: {response}") print(f"[custom_metric] retrieved_context: {retrieved_context}") print(f"[custom_metric] expected_response: {expected_response}") print(f"[custom_metric] expected_facts: {expected_facts}") print(f"[custom_metric] expected_retrieved_context: {expected_retrieved_context}") print(f"[custom_metric] trace: {trace}") return True ## Step 3: Call the metric directly before using the eval harness to iterate on the metric definition. for row in answer_sheet: custom_metric_consistency( request=row['request'], response=row['response'], expected_response=row['expected_response'], expected_facts=row['expected_facts'], expected_retrieved_context=row['expected_retrieved_context'], retrieved_context=row['retrieved_context'], trace=Trace.from_json(row['trace']), custom_expected=row['custom_expected'] ) ## Step 4: Once we are confident in the signature of our metric, we can run the harness with the answer sheet to trigger the output validation & make sure the UI reflects what we intended. with mlflow.start_run(run_name="exact_expected_response"): eval_results = mlflow.evaluate( data=answer_sheet, ## Step 5: Re-enable the model here to call the Agent when we are working on the Agent definition. # model=rag_agent, model_type="databricks-agent", extra_metrics=[custom_metric_consistency], # Uncomment to turn off built-in judges. # evaluator_config={ # 'databricks-agent': { # "metrics": [], # } # } ) display(eval_results.tables['eval_results'])
%md ## Example: Make sure the right keywords are extracted from the prompt in our Keyword RAG In this example, we will define a simple rag agent that: - Extracts salient keywords from the user query. The function that extracts keywords is a span of type `PARSER` - Finds chunks that contain these keywords. - Passes them to an LLM to generate a response. We will assert that the keywords extracted are correct, and that the chunks contain the keywords. While this example is a simple RAG, this approach can be generalized to any Agentic system. See the hidden code cells below for the definition of the Agent.
Example: Make sure the right keywords are extracted from the prompt in our Keyword RAG
In this example, we will define a simple rag agent that:
- Extracts salient keywords from the user query. The function that extracts keywords is a span of type
PARSER
- Finds chunks that contain these keywords.
- Passes them to an LLM to generate a response.
We will assert that the keywords extracted are correct, and that the chunks contain the keywords.
While this example is a simple RAG, this approach can be generalized to any Agentic system.
See the hidden code cells below for the definition of the Agent.
import mlflow import pandas as pd from databricks.agents.evals import metric from databricks.agents.evals import judges from mlflow.evaluation import Assessment # NOTE: We are passing the rag_agent to the evaluate() harness to generate responses & traces. evals = [ { "request": "What is Databricks?", "custom_expected": { "keywords": ["databricks"], } }, { "request": "When was Databricks founded?", "custom_expected": { "keywords": ["when", "databricks", "founded"] } }, { "request": "How do I convert a timestamp_ms to a timestamp in dbsql?", "custom_expected": { "keywords": ["timestamp_ms", "timestamp", "dbsql"] } } ] # This metric will compute 3 assessments for # a) whether there are extra keywords # b) whether there are missing keywords # c) whether the keywords are exactly correct. @metric def keywords_correct( request, response, trace, # This is the exact_expected_response from your eval dataframe. custom_expected ): # Find the "PARSER" span outputs to get the keywords in the trace. parser_spans = trace.search_spans(span_type="PARSER") keywords = parser_spans[0].outputs # Find extra keywords and missing keywords. extra_keywords = [] missing_keywords = [] for keyword in custom_expected["keywords"]: if keyword not in keywords: missing_keywords.append(keyword) for keyword in keywords: if keyword not in custom_expected["keywords"]: extra_keywords.append(keyword) extra_keywords_rationale = f"Extra keywords in trace: `{', '.join(extra_keywords)}`" if extra_keywords else None missing_keywords_rationale = f"Missing keywords in trace: `{' '.join(missing_keywords)}`" if missing_keywords else None keywords_incorrect = bool(len(missing_keywords) or len(extra_keywords)) keywords_incorrect_rationale = f"{extra_keywords_rationale or ''}\n{missing_keywords_rationale or ''}" if keywords_incorrect else None return [ Assessment(name='has_extra_keywords', value=len(extra_keywords) > 0, rationale=extra_keywords_rationale), Assessment(name='has_missing_keywords', value=len(missing_keywords) > 0, rationale=missing_keywords_rationale), Assessment(name='keywords_incorrect', value=keywords_incorrect, rationale=keywords_incorrect_rationale) ] with mlflow.start_run(run_name="keyword_agent"): eval_results = mlflow.evaluate( data=pd.DataFrame.from_records(evals), model=rag_agent, model_type="databricks-agent", extra_metrics=[keywords_correct], # Uncomment to turn off built-in judges. # evaluator_config={ # 'databricks-agent': { # "metrics": [], # } # } ) display(eval_results.tables['eval_results'])
%md ## Realistic example: custom metrics for a tool-calling Agent In this example, we will define a simple rag agent that has access to two tools, `add` and `multiply`. We will show two ways to determine whether the tool call choice was "correct". While this example is a simple RAG, this approach can be generalized to any Agentic system. See the hidden code cells below for the definition of the Agent.
Realistic example: custom metrics for a tool-calling Agent
In this example, we will define a simple rag agent that has access to two tools, add
and multiply
.
We will show two ways to determine whether the tool call choice was "correct".
While this example is a simple RAG, this approach can be generalized to any Agentic system.
See the hidden code cells below for the definition of the Agent.
from typing import Annotated, List from typing_extensions import TypedDict import mlflow from langgraph.prebuilt import ToolNode, tools_condition from langgraph.graph import StateGraph, START, END from langgraph.graph.message import add_messages from langchain_core import tools from databricks_langchain import ChatDatabricks mlflow.langchain.autolog() # Step 1: Define our tools @tools.tool def add(a: int, b: int) -> int: """Add two numbers.""" return a + b @tools.tool def multiply(a: int, b: int) -> int: """Multiply two numbers.""" return a * b # Step 2: Define our agent. We do this in LangGraph because it provides handy abstractions to invoke (and re-invoke) tools. class State(TypedDict): messages: Annotated[list, add_messages] def create_tool_calling_agent( tools: List[tools.structured.StructuredTool] = [add, multiply], endpoint: str = "databricks-meta-llama-3-3-70b-instruct" ): # Define the LLM and bind it with the tools llm = ChatDatabricks(endpoint=endpoint) llm_with_tools = llm.bind_tools(tools) # Build the agent graph graph_builder = StateGraph(State) graph_builder.add_node( "chatbot", lambda state: {"messages": [llm_with_tools.invoke(state["messages"])]} ) graph_builder.add_node("tools", ToolNode(tools=tools)) graph_builder.add_conditional_edges( "chatbot", tools_condition, ) # Any time a tool is called, we return to the chatbot to decide the next step graph_builder.add_edge("tools", "chatbot") graph_builder.add_edge(START, "chatbot") return graph_builder.compile() agent = create_tool_calling_agent() # Step 3: Create a helper to call our agent from `mlflow.evaluate` def tool_calling_agent(model_input): response = agent.invoke(model_input)["messages"][-1].to_json() return response["kwargs"]["content"]
%md ### Approach 1: Define the expected tool to be chosen In this approach, we explicitly define the tool name that we expect to be called and verify that the tool is called.
Approach 1: Define the expected tool to be chosen
In this approach, we explicitly define the tool name that we expect to be called and verify that the tool is called.
import mlflow import pandas as pd from databricks.agents.evals import metric eval_data = pd.DataFrame( [ { "request": "what is 3 * 12?", "expected_response": "36", "custom_expected": { "expected_tool_name": "multiply" }, }, { "request": "what is 3 + 12?", "expected_response": "15", "custom_expected": { "expected_tool_name": "add" }, }, ] ) """ `tool_calls` returns a `List[ToolCallInvocation] @dataclasses.dataclass class ToolCallInvocation: tool_name: str tool_call_args: Dict[str, Any] tool_call_id: Optional[str] = None tool_call_result: Optional[Dict[str, Any]] = None # Only available from the trace raw_span: Optional[mlflow_entities.Span] = None available_tools: Optional[List[Dict[str, Any]]] = None """ @metric def is_correct_tool(tool_calls, custom_expected): # Metric to check whether the first tool call is the expected tool return tool_calls[0].tool_name == custom_expected["expected_tool_name"] results = mlflow.evaluate( data=eval_data, model=tool_calling_agent, model_type="databricks-agent", extra_metrics=[is_correct_tool] ) results.tables["eval_results"].display()
%md ### Approach 2: Judge whether the tool choice was reasonable This approach uses the `available_tools` attribute which contains a list of possible tools that can be called. It then uses the guidelines judge to assess if the tool call choice is reasonable given the list of available tools.
Approach 2: Judge whether the tool choice was reasonable
This approach uses the available_tools
attribute which contains a list of possible tools that can be called. It then uses the guidelines judge to assess if the tool call choice is reasonable given the list of available tools.
from databricks.agents.evals import judges @metric def is_reasonable_tool(request, trace, tool_calls): # Metric using the guideline adherence judge to determine whether the chosen tools are reasonable # given the set of available tools. Note that `guidelines_context` requires `databricks-agents >= 0.20.0` return judges.guideline_adherence( request=request["messages"][0]["content"], guidelines=[ "The selected tool must be a reasonable tool call with respect to the request and available tools.", ], guidelines_context={ "available_tools": str(tool_calls[0].available_tools), "chosen_tools": str([tool_call.tool_name for tool_call in tool_calls]), }, ) results = mlflow.evaluate( data=eval_data, model=tool_calling_agent, model_type="databricks-agent", extra_metrics=[is_reasonable_tool] ) results.tables["eval_results"].display()
%md ## Realistic example: multi-turn evaluation of a tool-calling agent In this example, we use the same tool-calling agent as above. We will show an example of how to extract the message history from the request and create a custom metric over it.
Realistic example: multi-turn evaluation of a tool-calling agent
In this example, we use the same tool-calling agent as above.
We will show an example of how to extract the message history from the request and create a custom metric over it.
from typing import Any, Dict, List, Tuple import mlflow _MESSAGES = "messages" def extract_message_history(request: Dict[str, Any]) -> Tuple[List[Dict[str, str]], Dict[str, str]]: """ Extract the message history from a request (i.e., all messages except the last). The following input formats are accepted (in a dictionary representation): ChatCompletionRequest, ChatModel, and ChatAgent. :param request: The request formatted as a dictionary :return: List of messages and the last message """ if not isinstance(request, Dict): raise ValueError(f"Expected a dictionary, got {type(request)}") if _MESSAGES in request: return request[_MESSAGES][:-1], request[_MESSAGES][-1] raise ValueError(f"Invalid input: {request}")
# Dummy code to append a message to each of the evaluations in the last round of evaluation, for the sake of demonstration import json def create_multiturn_request(trace): trace_obj = mlflow.entities.Trace.from_json(trace) response = json.loads(trace_obj.data.response) response["messages"].append({"role": "user", "content": "Now multiply the result by 10"}) return response multiturn_eval_df = results.tables["eval_results"][["trace"]].assign( request=lambda df: df["trace"].apply(create_multiturn_request) )[["request"]]
@metric def num_previous_messages(request): # Metric to compute the number of messages in the message history previous_messages, _ = extract_message_history(request) return len(previous_messages) results = mlflow.evaluate( data=multiturn_eval_df, model=tool_calling_agent, model_type="databricks-agent", extra_metrics=[num_previous_messages] ) results.tables["eval_results"].display()