How to run an evaluation and view the results
Preview
This feature is in Public Preview.
This article describes how to run an evaluation and view the results as you develop your AI application. For information about how to monitor the quality of deployed agents on production traffic, see How to monitor the quality of your agent on production traffic.
To use Agent Evaluation during app development, you must specify an evaluation set. An evaluation set is a set of typical requests that a user would make to your application. The evaluation set can also include the expected response (ground truth) for each input request. If the expected response is provided, Agent Evaluation can compute additional quality metrics, such as correctness and context sufficiency. The purpose of the evaluation set is to help you measure and predict the performance of your agentic application by testing it on representative questions.
For more information about evaluation sets, see Evaluation sets. For the required schema, see Agent Evaluation input schema.
To begin evaluation, you use the mlflow.evaluate()
method from the MLflow API. mlflow.evaluate()
computes quality assessments along with latency and cost metrics for each input in the evaluation set, and also aggregates these results across all inputs. These results are also referred to as the evaluation results. The following code shows an example of calling mlflow.evaluate()
:
%pip install databricks-agents
dbutils.library.restartPython()
import mlflow
import pandas as pd
eval_df = pd.DataFrame(...)
# Puts the evaluation results in the current Run, alongside the logged model parameters
with mlflow.start_run():
logged_model_info = mlflow.langchain.log_model(...)
mlflow.evaluate(data=eval_df, model=logged_model_info.model_uri,
model_type="databricks-agent")
In this example, mlflow.evaluate()
logs its evaluation results in the enclosing MLflow run, along with information logged by other commands (such as model parameters). If you call mlflow.evaluate()
outside an MLflow run, it starts a new run and logs evaluation results in that run. For more information about mlflow.evaluate()
, including details on the evaluation results that are logged in the run, see the MLflow documentation.
Requirements
Partner-powered AI assistive features must be enabled for your workspace.
How to provide input to an evaluation run
There are two ways to provide input to an evaluation run:
Provide previously generated outputs to compare to the evaluation set. This option is recommended if you want to evaluate outputs from an application that is already deployed to production, or if you want to compare evaluation results between evaluation configurations.
With this option, you specify an evaluation set as shown in the following code. The evaluation set must include previously generated outputs. For more detailed examples, see Example: How to pass previously generated outputs to Agent Evaluation.
evaluation_results = mlflow.evaluate( data=eval_set_with_chain_outputs_df, # pandas DataFrame with the evaluation set and application outputs model_type="databricks-agent", )
Pass the application as an input argument.
mlflow.evaluate()
calls into the application for each input in the evaluation set and reports quality assessments and other metrics for each generated output. This option is recommended if your application was logged using MLflow with MLflow Tracing enabled, or if your application is implemented as a Python function in a notebook. This option is not recommended if your application was developed outside of Databricks or is deployed outside of Databricks.With this option, you specify the evaluation set and the application in the function call as shown in the following code. For more detailed examples, see Example: How to pass an application to Agent Evaluation.
evaluation_results = mlflow.evaluate( data=eval_set_df, # pandas DataFrame containing just the evaluation set model=model, # Reference to the MLflow model that represents the application model_type="databricks-agent", )
For details about the evaluation set schema, see Agent Evaluation input schema.
Evaluation outputs
Agent Evaluation returns its outputs from mlflow.evaluate()
as dataframes and also logs these outputs to the MLflow run. You can inspect the outputs in the notebook or from the page of the corresponding MLflow run.
Review output in the notebook
The following code shows some examples of how to review the results of an evaluation run from your notebook.
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
###
# Run evaluation
###
evaluation_results = mlflow.evaluate(..., model_type="databricks-agent")
###
# Access aggregated evaluation results across the entire evaluation set
###
results_as_dict = evaluation_results.metrics
results_as_pd_df = pd.DataFrame([evaluation_results.metrics])
# Sample usage
print(f"The percentage of generated responses that are grounded: {results_as_dict['response/llm_judged/groundedness/percentage']}")
###
# Access data about each question in the evaluation set
###
per_question_results_df = evaluation_results.tables['eval_results']
# Show information about responses that are not grounded
per_question_results_df[per_question_results_df["response/llm_judged/groundedness/rating"] == "no"].display()
The per_question_results_df
dataframe includes all of the columns in the input schema and all evaluation results specific to each request. For more details about the computed results, see How quality, cost, and latency are assessed by Agent Evaluation.
Review output using the MLflow UI
Evaluation results are also available in the MLflow UI. To access the MLflow UI, click on the Experiment icon in notebook’s right sidebar and then on the corresponding run, or click the links that appear in the cell results for the notebook cell in which you ran mlflow.evaluate()
.
Review evaluation results for a single run
This section describes how to review the evaluation results for an individual run. To compare results across runs, see Compare evaluation results across runs.
Overview of quality assessments by LLM judges
Per-request judge assessments are available in databricks-agents
version 0.3.0 and above.
To see an overview of the LLM-judged quality of each request in the evaluation set, click the Evaluation results tab on the MLflow Run page. This page shows a summary table of each evaluation run. For more details, click the Evaluation ID of a run.
This overview shows the assessments of different judges for each request, the quality-pass/-fail status of each request based on these assessments, and the root-cause for failed requests. Clicking on a row in the table will take you to the details page for that request that includes the following:
Model output: The generated response from the agentic app and its trace if included.
Expected output: The expected response for each request.
Detailed assessments: The assessments of the LLM judges on this data. Click See details to display the justifications provided by the judges.
Compare evaluation results across runs
It’s important to compare evaluation results across runs to see how your agentic application responds to changes. Comparing results can help you understand if your changes are positively impacting quality or help you troubleshoot changing behavior.
Compare per-request results across runs
To compare data for each individual request across runs, click the Evaluation tab on the Experiment page. A table shows each question in the evaluation set. Use the drop-down menus to select the columns to view.
Compare aggregated results across runs
You can access the same aggregated results from the Experiment page, which also allows you to compare results across different runs. To access the Experiment page, click the Experiment icon in notebook’s right sidebar, or click the links that appear in the cell results for the notebook cell in which you ran mlflow.evaluate()
.
On the Experiment page, click . This allows you to visualize the aggregated results for the selected run and compare to past runs.
Which judges are run
By default, for each evaluation record, Mosaic AI Agent Evaluation applies the subset of judges that best matches the information present in the record. Specifically:
If the record includes a ground-truth response, Agent Evaluation applies the
context_sufficiency
,groundedness
,correctness
, andsafety
judges.If the record does not include a ground-truth response, Agent Evaluation applies the
chunk_relevance
,groundedness
,relevance_to_query
, andsafety
judges.
You can also explicitly specify the judges to apply to each request by using the evaluator_config
argument of mlflow.evaluate()
as follows:
# Complete list of built-in LLM judges
# "chunk_relevance", "context_sufficiency", "correctness", "groundedness", "relevance_to_query", "safety"
evaluation_results = mlflow.evaluate(
data=eval_df,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
# Run only LLM judges that don't require ground-truth. Use an empty list to not run any built-in judge.
"metrics": ["groundedness", "relevance_to_query", "chunk_relevance", "safety"]
}
}
)
Note
You cannot disable the non-LLM judge metrics for chunk retrieval, chain token counts, or latency.
In addition to the built-in judges, you can define a custom LLM judge to evaluate criteria specific to your use case. See Customize LLM judges.
See Information about the models powering LLM judges for LLM judge trust and safety information.
For more details on the evaluation results and metrics, see How quality, cost, and latency are assessed by Agent Evaluation.
Example: How to pass an application to Agent Evaluation
To pass an application to mlflow_evaluate()
, use the model
argument. There are 5 options for passing an application in the model
argument.
A model registered in Unity Catalog.
An MLflow logged model in the current MLflow experiment.
A PyFunc model that is loaded in the notebook.
A local function in the notebook.
A deployed agent endpoint.
See the following sections for code examples illustrating each option.
Option 1. Model registered in Unity Catalog
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
evaluation_results = mlflow.evaluate(
data=eval_set_df, # pandas DataFrame with just the evaluation set
model = "models:/catalog.schema.model_name/1" # 1 is the version number
model_type="databricks-agent",
)
Option 2. MLflow logged model in the current MLflow experiment
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
# In the following lines, `6b69501828264f9s9a64eff825371711` is the run_id, and `chain` is the artifact_path that was
# passed with mlflow.xxx.log_model(...).
# If you called model_info = mlflow.langchain.log_model() or mlflow.pyfunc.log_model(), you can access this value using `model_info.model_uri`.
evaluation_results = mlflow.evaluate(
data=eval_set_df, # pandas DataFrame with just the evaluation set
model = "runs:/6b69501828264f9s9a64eff825371711/chain"
model_type="databricks-agent",
)
Option 3. PyFunc model that is loaded in the notebook
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
evaluation_results = mlflow.evaluate(
data=eval_set_df, # pandas DataFrame with just the evaluation set
model = mlflow.pyfunc.load_model(...)
model_type="databricks-agent",
)
Option 4. Local function in the notebook
The function receives an input formatted as follows:
{
"messages": [
{
"role": "user",
"content": "What is MLflow?",
}
],
...
}
The function must return a value in one of the following three supported formats:
Plain string containing the response of the model.
A dictionary in
ChatCompletionResponse
format. For example:{ "choices": [ { "message": { "role": "assistant", "content": "MLflow is a machine learning toolkit.", }, ... } ], ..., }
A dictionary in
StringResponse
format, such as{ "content": "MLflow is a machine learning toolkit.", ... }
.
The following example uses a local function to wrap a foundation model endpoint and evaluate it:
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
def model(model_input):
client = mlflow.deployments.get_deploy_client("databricks")
return client.predict(endpoint="endpoints:/databricks-meta-llama-3-1-405b-instruct", inputs={"messages": model_input["messages"]})
evaluation_results = mlflow.evaluate(
data=eval_set_df, # pandas DataFrame with just the evaluation set
model = model
model_type="databricks-agent",
)
Option 5. Deployed agent endpoint
This option only works when you use agent endpoints that have been deployed using databricks.agents.deploy
and with databricks-agents
SDK version 0.8.0
or above. For foundation models or older SDK versions, use Option 4 to wrap the model in a local function.
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
# In the following lines, `endpoint-name-of-your-agent` is the name of the agent endpoint.
evaluation_results = mlflow.evaluate(
data=eval_set_df, # pandas DataFrame with just the evaluation set
model = "endpoints:/endpoint-name-of-your-agent"
model_type="databricks-agent",
)
How to pass the evaluation set when the application is included in the mlflow_evaluate()
call
In the following code, data
is a pandas DataFrame with your evaluation set. These are simple examples. See the input schema for details.
# You do not have to start from a dictionary - you can use any existing pandas or Spark DataFrame with this schema.
# Minimal evaluation set
bare_minimum_eval_set_schema = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
}]
# Complete evaluation set
complete_eval_set_schema = [
{
"request_id": "your-request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
# In `expected_retrieved_context`, `content` is optional, and does not provide any additional functionality.
"content": "Answer segment 1 related to What is the difference between reduceByKey and groupByKey in Spark?",
"doc_uri": "doc_uri_2_1",
},
{
"content": "Answer segment 2 related to What is the difference between reduceByKey and groupByKey in Spark?",
"doc_uri": "doc_uri_2_2",
},
],
"expected_response": "There's no significant difference.",
}]
# Convert dictionary to a pandas DataFrame
eval_set_df = pd.DataFrame(bare_minimum_eval_set_schema)
# Use a Spark DataFrame
import numpy as np
spark_df = spark.table("catalog.schema.table") # or any other way to get a Spark DataFrame
eval_set_df = spark_df.toPandas()
Example: How to pass previously generated outputs to Agent Evaluation
This section describes how to pass previously generated outputs in the mlflow_evaluate()
call. For the required evaluation set schema, see Agent Evaluation input schema.
In the following code, data
is a pandas DataFrame with your evaluation set and outputs generated by the application. These are simple examples. See the input schema for details.
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
evaluation_results = mlflow.evaluate(
data=eval_set_with_app_outputs_df, # pandas DataFrame with the evaluation set and application outputs
model_type="databricks-agent",
)
# You do not have to start from a dictionary - you can use any existing pandas or Spark DataFrame with this schema.
# Minimum required input
bare_minimum_input_schema = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
}]
# Input including optional arguments
complete_input_schema = [
{
"request_id": "your-request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
# In `expected_retrieved_context`, `content` is optional, and does not provide any additional functionality.
"content": "Answer segment 1 related to What is the difference between reduceByKey and groupByKey in Spark?",
"doc_uri": "doc_uri_2_1",
},
{
"content": "Answer segment 2 related to What is the difference between reduceByKey and groupByKey in Spark?",
"doc_uri": "doc_uri_2_2",
},
],
"expected_response": "There's no significant difference.",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional. If provided, the Databricks Context Relevance LLM Judge is executed to check the `content`'s relevance to the `request`.
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}]
# Convert dictionary to a pandas DataFrame
eval_set_with_app_outputs_df = pd.DataFrame(bare_minimum_input_schema)
# Use a Spark DataFrame
import numpy as np
spark_df = spark.table("catalog.schema.table") # or any other way to get a Spark DataFrame
eval_set_with_app_outputs_df = spark_df.toPandas()
Example: Use a custom function to process responses from LangGraph
LangGraph agents, especially those with chat functionality, can return multiple messages for a single inference call. It is the user’s responsibility to convert the agent’s response to a format that Agent Evaluation supports.
One approach is to use a custom function to process the response. The following example shows a custom function that extracts the last chat message from a LangGraph model. This function is then used in mlflow.evaluate()
to return a single string response, which can be compared to the ground_truth
column.
The example code makes the following assumptions:
The model accepts input in the format {“messages”: [{“role”: “user”, “content”: “hello”}]}.
The model returns a list of strings in the format [“response 1”, “response 2”].
The following code sends the concatenated responses to the judge in this format: “response 1nresponse2”
import mlflow
import pandas as pd
from typing import List
loaded_model = mlflow.langchain.load_model(model_uri)
eval_data = pd.DataFrame(
{
"inputs": [
"What is MLflow?",
"What is Spark?",
],
"expected_response": [
"MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
"Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, offering improvements in speed and ease of use. Spark provides libraries for various tasks such as data ingestion, processing, and analysis through its components like Spark SQL for structured data, Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
],
}
)
def custom_langgraph_wrapper(model_input):
predictions = loaded_model.invoke({"messages": model_input["messages"]})
# Assuming `predictions` is a list of strings
return predictions.join("\n")
with mlflow.start_run() as run:
results = mlflow.evaluate(
custom_langgraph_wrapper, # Pass the function defined above
data=eval_data,
model_type="databricks-agent",
)
print(results.metrics)
Create a dashboard with metrics
When you are iterating on the quality of your agent, you might want to share a dashboard with your stakeholders that shows how the quality has improved over time. You can extract the metrics from your MLflow evaluation runs, save the values into a Delta table, and create a dashboard.
The following example shows how to extract and save the metric values from the most recent evaluation run in your notebook:
uc_catalog_name = "catalog"
uc_schema_name = "schema"
table_name = "results"
eval_results = mlflow.evaluate(
model=logged_agent_info.model_uri, # use the logged Agent
data=evaluation_set, # Run the logged Agent for all queries defined above
model_type="databricks-agent", # use Agent Evaluation
)
# The `append_metrics_to_table function` is defined below
append_metrics_to_table("<identifier-for-table>", eval_results.metrics, f"{uc_catalog_name}.{uc_schema_name}.{table_name}")
The following example shows how to extract and save metric values for past runs that you have saved in your MLflow experiment.
import pandas as pd
def get_mlflow_run(experiment_name, run_name):
runs = mlflow.search_runs(experiment_names=[experiment_name], filter_string=f"run_name = '{run_name}'", output_format="list")
if len(runs) != 1:
raise ValueError(f"Found {len(runs)} runs with name {run_name}. {run_name} must identify a single run. Alternatively, you can adjust this code to search for a run based on `run_id`")
return runs[0]
run = get_mlflow_run(experiment_name ="/Users/<user_name>/db_docs_mlflow_experiment", run_name="evaluation__2024-10-09_02:27:17_AM")
# The `append_metrics_to_table` function is defined below
append_metrics_to_table("<identifier-for-table>", run.data.metrics, f"{uc_catalog_name}.{uc_schema_name}.{table_name}")
You can now create a dashboard using this data.
The following code defines the function append_metrics_to_table
that is used in the previous examples.
# Definition of `append_metrics_to_table`
def append_metrics_to_table(run_name, mlflow_metrics, delta_table_name):
data = mlflow_metrics.copy()
# Add identifying run_name and timestamp
data["run_name"] = run_name
data["timestamp"] = pd.Timestamp.now()
# Remove metrics with error counts
data = {k: v for k, v in mlflow_metrics.items() if "error_count" not in k}
# Convert to a Spark DataFrame(
metrics_df = pd.DataFrame([data])
metrics_df_spark = spark.createDataFrame(metrics_df)
# Append to the Delta table
metrics_df_spark.write.mode("append").saveAsTable(delta_table_name)
Limitation
For multi-turn conversations, the evaluation output records only the last entry in the conversation.
Information about the models powering LLM judges
LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
Disabling Partner-powered AI assistive features prevents the LLM judge from calling Partner-powered models.
Data sent to the LLM judge is not used for any model training.
LLM judges are intended to help customers evaluate their RAG applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.