How to run an evaluation and view the results

Preview

This feature is in Public Preview.

This article describes how to run an evaluation and view the results using Mosaic AI Agent Evaluation.

To run an evaluation, you must specify an evaluation set. An evaluation set is a set of typical requests that a user would make to your agentic application. The evaluation set can also include the expected output for each input request. The purpose of the evaluation set is to help you measure and predict the performance of your agentic application by testing it on representative questions.

For more information about evaluation sets, including the required schema, see Evaluation sets.

To begin evaluation, you use the mlflow.evaluate() method from the MLflow API. mlflow.evaluate() computes quality, latency, and cost metrics for each input in the evaluation set and also computes aggregate metrics across all inputs. These metrics are also referred to as the evaluation results. The following code shows an example of calling mlflow.evaluate():

%pip install databricks-agents
dbutils.library.restartPython()

import mlflow
import pandas as pd

eval_df = pd.DataFrame(...)

# Puts the evaluation results in the current Run, alongside the logged model parameters
with mlflow.start_run():
        logged_model_info = mlflow.langchain.log_model(...)
        mlflow.evaluate(data=eval_df, model=logged_model_info.model_uri,
                       model_type="databricks-agent")

In this example, mlflow.evaluate() logs its evaluation results in the enclosing MLflow run, along with information logged by other commands (e.g., model parameters). If you call mlflow.evaluate() outside an MLflow run, it starts a new run and logs evaluation results in that run. For more information about mlflow.evaluate(), including details on the evaluation results that are logged in the run, see the MLflow documentation.

Requirements

Partner-Powered AI assistive features must be enabled for your workspace.

How to provide input to an evaluation run

There are two ways to provide input to an evaluation run:

  • Pass the application as an input argument. mlflow.evaluate() calls into the application for each input in the evaluation set and computes metrics on the generated output. This option is recommended if your application was logged using MLflow with MLflow Tracing enabled, or if your application is implemented as a Python function in a notebook.

  • Provide previously generated outputs to compare to the evaluation set. This option is recommended if your application was developed outside of Databricks, if you want to evaluate outputs from an application that is already deployed to production, or if you want to compare evaluation results between evaluation configurations.

The following code samples show a minimal example for each method. For details about the evaluation set schema, see Evaluation set schema.

  • To have the mlflow.evaluate() call generate the outputs, specify the evaluation set and the application in the function call as shown in the following code. For a more detailed example, see Example: Agent Evaluation runs application.

    evaluation_results = mlflow.evaluate(
        data=eval_set_df,  # pandas Dataframe containing just the evaluation set
        model=model,  # Reference to the MLflow model that represents the application
        model_type="databricks-agent",
    )
    
  • To provide previously generated outputs, specify only the evaluation set as shown in the following code, but ensure that it includes the generated outputs. For a more detailed example, see Example: Previously generated outputs provided.

    evaluation_results = mlflow.evaluate(
        data=eval_set_with_chain_outputs_df,  # pandas Dataframe with the evaluation set and application outputs
        model_type="databricks-agent",
    )
    

Evaluation outputs

An evaluation generates two types of outputs:

  • Data about each request in the evaluation set, including the following:

    • Inputs sent to the agentic application.

    • The application’s output response

    • All intermediate data generated by the application, such as retrieved_context, trace, and so on.

    • Ratings and rationales from each Databricks-specified and customer-specified LLM judge. The ratings characterize different quality aspects of the application outputs, including correctness, groundedness, retrieval precision, and so on.

    • Other metrics based on the application’s trace, including latency and token counts for different steps.

  • Aggregated metric values across the entire evaluation set, such as average and total token counts, average latencies, and so on.

These two types of outputs are returned from mlflow.evaluate() and are also logged in an MLflow run. You can inspect the outputs in the notebook or from the page of the corresponding MLflow run.

Review output in the notebook

The following code shows some examples of how to review the results of an evaluation run from your notebook.

%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

###
# Run evaluation
###
evaluation_results = mlflow.evaluate(..., model_type="databricks-agent")

###
# Access aggregated metric values across the entire evaluation set
###
metrics_as_dict = evaluation_results.metrics
metrics_as_pd_df = pd.DataFrame([evaluation_results.metrics])

# Sample usage
print(f"The percentage of generated responses that are grounded: {metrics_as_dict['response/llm_judged/groundedness/percentage']}")


###
# Access data about each question in the evaluation set
###

per_question_results_df = evaluation_results.tables['eval_results']

# Show information about responses that are not grounded
per_question_results_df[per_question_results_df["response/llm_judged/groundedness/rating"] == "no"].display()

The per-request dataframe includes all of the columns in the input schema and all computed metrics specific to each request. For more details about each reported metric, see Use agent metrics & LLM judges to evaluate app performance.

Review output using the MLflow UI

Results of your evaluation are also available in the MLflow UI. To access the MLflow UI, click on the Experiment icon Experiment icon in notebook’s right sidebar, or click the links that appear in the cell results for the notebook cell in which you ran mlflow.evaluate().

Aggregated metrics across the full evaluation set (MLflow UI)

To see aggregated metric values across the full evaluation set, click display chart icon on the Experiment page. This allows you to visualize the metrics for the selected run, and compare to past runs.

aggregated results

You can alse see the aggregated metric values from the run page, using either the Overview tab (for numerical values) or the Model metrics tab (for charts).

Aggregated metrics on the Overview tab

evaluation metrics, values

Aggregated metrics on the Model metrics tab

evaluation metrics, charts

Data about each request in the evaluation set (MLflow UI)

To view data for each individual request in the evaluation set, click the Evaluation tab on the Experiment page. A table shows each question in the evaluation set. Use the drop-down menus to select the columns to view.

individual questions in evaluation set

You can also see the results from the run page. Click the Artifacts tab and then select the eval_results.json artifact.

evaluation results table on artifacts tab

Examples of mlflow.evaluate() calls

This section includes code samples of mlflow.evaluate() calls, illustrating options for passing the application and the evaluation set to the call.

Example: Agent Evaluation runs application

%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

###
# mlflow.evaluate() call
###
evaluation_results = mlflow.evaluate(
    data=eval_set_df,  # pandas DataFrame with just the evaluation set
    model=model,  # Reference to the application
    model_type="databricks-agent",
)

###
# There are 4 options for passing an application in the `model` argument.
####

#### Option 1. Reference to a Unity Catalog registered model
model = "models:/catalog.schema.model_name/1"  # 1 is the version number

#### Option 2. Reference to a MLflow logged model in the current MLflow Experiment
model = "runs:/6b69501828264f9s9a64eff825371711/chain"
# `6b69501828264f9s9a64eff825371711` is the run_id, `chain` is the artifact_path that was
# passed when calling mlflow.xxx.log_model(...).
# If you called model_info = mlflow.langchain.log_model() or mlflow.pyfunc.log_model(), you can access this value using `model_info.model_uri`.

#### Option 3. A PyFunc model that is loaded in the notebook
model = mlflow.pyfunc.load_model(...)

#### Option 4. A local function in the notebook
def model_fn(model_input):
  # code that implements the application
  response = 'the answer!'
  return response

model = model_fn

###
# `data` is a pandas DataFrame with your evaluation set.
# These are simple examples. See the input schema for details.
####

# You do not have to start from a dictionary - you can use any existing pandas or
# Spark DataFrame with this schema.

# Minimal evaluation set
bare_minimum_eval_set_schema = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
    }]

# Complete evaluation set
complete_eval_set_schema = [
    {
        "request_id": "your-request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                # In `expected_retrieved_context`, `content` is optional, and does not provide any additional functionality.
                "content": "Answer segment 1 related to What is the difference between reduceByKey and groupByKey in Spark?",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "Answer segment 2 related to What is the difference between reduceByKey and groupByKey in Spark?",
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_response": "There's no significant difference.",
    }]

#### Convert dictionary to a pandas DataFrame
eval_set_df = pd.DataFrame(bare_minimum_eval_set_schema)

#### Use a Spark DataFrame
import numpy as np
spark_df = spark.table("catalog.schema.table") # or any other way to get a Spark DataFrame
eval_set_df = spark_df.toPandas()

Example: Previously generated outputs provided

For the required evaluation set schema, see Evaluation sets.

%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

###
# mlflow.evaluate() call
###
evaluation_results = mlflow.evaluate(
    data=eval_set_with_app_outputs_df,  # pandas Dataframe with the evaluation set and application outputs
    model_type="databricks-agent",
)

###
# `data` is a pandas DataFrame with your evaluation set and outputs generated by the application.
# These are simple examples. See the input schema for details.
####

# You do not have to start from a dictionary - you can use any existing pandas or
# Spark DataFrame with this schema.

# Bare minimum data
bare_minimum_input_schema = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
    }]

complete_input_schema  = [
    {
        "request_id": "your-request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                # In `expected_retrieved_context`, `content` is optional, and does not provide any additional functionality.
                "content": "Answer segment 1 related to What is the difference between reduceByKey and groupByKey in Spark?",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "Answer segment 2 related to What is the difference between reduceByKey and groupByKey in Spark?",
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_response": "There's no significant difference.",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional. If provided, the Databricks Context Relevance LLM Judge is executed to check the `content`'s relevance to the `request`.
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }]

#### Convert dictionary to a pandas DataFrame
eval_set_with_app_outputs_df = pd.DataFrame(bare_minimum_input_schema)

#### Use a Spark DataFrame
import numpy as np
spark_df = spark.table("catalog.schema.table") # or any other way to get a Spark DataFrame
eval_set_with_app_outputs_df = spark_df.toPandas()