Evaluation sets

Preview

This feature is in Public Preview.

To measure the quality of an agentic application, you need to be able to define what a high-quality, accurate response looks like. You do that by providing an evaluation set. This article covers the required schema of the evaluation set, which metrics are calculated based on what data is present in the evaluation set, and some best practices for creating an evaluation set.

Databricks recommends creating a human-labeled evaluation set. This is a set of representative questions and ground-truth answers. You can also optionally provide the supporting documents that you expect the response to be based on if your application includes a retrieval step.

A good evaluation set has the following characteristics:

  • Representative: It should accurately reflect the range of requests the application will encounter in production.

  • Challenging: It should include difficult and diverse cases to effectively test the full range of the application’s capabilities.

  • Continually updated: It should be updated regularly to reflect how the application is used and the changing patterns of production traffic.

To learn how to run an evaluation using the evaluation set, see How to run an evaluation and view the results.

Evaluation set schema

The following table shows the schema required for the DataFrame provided in the mlflow.evaluate() call.

Column

Data type

Description

Application passed as input argument

Previously generated outputs provided

request_id

string

Unique identifier of request.

Optional

Optional

request

string

Input to the application to evaluate, user’s question or query. For example, “What is RAG?”

Required

Required

expected_retrieved_context

array

Array of objects containing the expected retrieved context for the request (if the application includes a retrieval step). Array schema

Optional

Optional

expected_response

string

Ground-truth (correct) answer for the input request.

Optional

Optional

response

string

Response generated by the application being evaluated.

Generated by Agent Evaluation

Optional. If not provided then derived from the Trace. Either response or trace is required.

retrieved_context

array

Retrieval results generated by the retriever in the application being evaluated. If multiple retrieval steps are in the application, this is the retrieval results from the last step (chronologically in the trace). Array schema

Generated by Agent Evaluation

Optional. If not provided then derived from the provided trace.

trace

JSON string of MLflow Trace

MLflow Trace of the application’s execution on the corresponding request.

Generated by Agent Evaluation

Optional. Either response or trace is required.

Schema for arrays in evaluation set

The schema of the arrays expected_retrieved_context and retrieved_context is shown in the following table:

Column

Data type

Description

Application passed as input argument

Previously generated outputs provided

content

string

Contents of the retrieved context. String in any format, such as HTML, plain text, or Markdown.

Optional

Optional

doc_uri

string

Unique identifier (URI) of the parent document where the chunk came from.

Required

Required

Metrics available when the application is passed in through the model input argument

The metrics calculated are determined by the data you provide in the evaluation set. The table shows the dependencies for evaluations that take the application as an input argument. The columns indicate the data included in the evaluation set, and an X indicates that the metric is supported when that data is provided.

For details about what these metrics measure, see Use agent metrics & LLM judges to evaluate app performance.

Calculated metrics

request

request and expected_response

request, expected_response, and expected_retrieved_context

response/llm_judged/relevance_to_query/rating

response/llm_judged/safety/rating

response/llm_judged/groundedness/rating

retrieval/llm_judged/chunk_relevance_precision

agent/total_token_count

agent/input_token_count

agent/output_token_count

response/llm_judged/correctness/rating

retrieval/ground_truth/document_recall

Sample evaluation set with only request

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
    }
]

Sample evaluation set with request and expected_response

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
    }
]

Sample evaluation set with request, expected_response, and expected_retrieved_content

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_1",
            },
            {
                "doc_uri": "doc_uri_2",
            },
        ],
        "expected_response": "There's no significant difference.",
    }
]

Metrics available when application outputs are provided

The metrics calculated are determined by the data you provide in the evaluation set. The table shows the dependencies for evaluations where you provide a Dataframe with the evaluation set and application outputs. The columns indicate the data included in the evaluation set, and an X indicates that the metric is supported when that data is provided.

Calculated metrics

request and response

request, response, and retrieved_context

request, response, retrieved_context, and expected_response

request, response, retrieved_context, expected_response, and expected_retrieved_context

response/llm_judged/relevance_to_query/rating

response/llm_judged/safety/rating

agent/request_token_count

agent/response_token_count

Customer-defined LLM judges

retrieval/llm_judged/chunk_relevance/precision

response/llm_judged/groundedness/rating

response/llm_judged/correctness/rating

retrieval/ground_truth/document_recall

Sample evaluation set with only request and response

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
    }
]

Sample evaluation set with request, response, and retrieved_context

eval_set = [
    {
        "request_id": "request-id", # optional, but useful for tracking
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Sample evaluation set with request, response, retrieved_context, and expected_response

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Sample evaluation set with request, response, retrieved_context, expected_response, and expected_retrieved_context

level_4_data  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_2_1",
            },
            {
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_response": "There's no significant difference.",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Best practices for developing an evaluation set

  • Consider each sample, or group of samples, in the evaluation set as a unit test. That is, each sample should correspond to a specific scenario with an explicit expected outcome. For example, consider testing longer contexts, multi-hop reasoning, and ability to infer answers from indirect evidence.

  • Consider testing adversarial scenarios from malicious users.

  • There is no specific guideline on the number of questions to include in an evaluation set, but clear signals from high-quality data typically perform better than noisy signals from weak data.

  • Consider including examples that are very challenging, even for humans to answer.

  • Whether you are building a general-purpose application or targeting a specific domain, your app will likely encounter a wide variety of questions. The evaluation set should reflect that. For example, if you are creating an application to field specific HR questions, you should still consider testing other domains (for example, operations), to ensure that the application does not hallucinate or provide harmful responses.

  • High-quality, consistent human-generated labels are the best way to ensure that the ground truth values that you provide to the application accurately reflect the desired behavior. Some steps to ensure high-quality human labels are the following:

    • Aggregate responses (labels) from multiple human labelers for the same question.

    • Ensure that labeling instructions are clear and that the human labelers are consistent.

    • Ensure that the conditions for the human-labeling process are identical to the format of requests submitted to the RAG application.

  • Human labelers are by nature noisy and inconsistent, for example due to different interpretations of the question. This is an important part of the process. Using human labeling can reveal interpretations of questions that you had not considered, and that might provide insight into behavior you observe in your application.