Evaluation sets (MLflow 2)

MLflow 2

This page describes usage of Agent Evaluation version 0.22 with MLflow 2. Databricks recommends using MLflow 3, which is integrated with Agent Evaluation >1.0. In MLflow 3, Agent Evaluation APIs are now part of the mlflow package.

For information on this topic, see Building MLflow Evaluation Datasets.

To measure the quality of an AI agent, you need to be able to define a representative set of requests along with criteria that characterize high-quality responses. You do that by providing an evaluation set. This article covers the various options for your evaluation set and some best practices for creating an evaluation set.

Databricks recommends creating a human-labeled evaluation set, which consists of representative questions and ground-truth answers. If your application includes a retrieval step, you can optionally provide the supporting documents on which you expect the response to be based. To help you get started on creating an evaluation set, Databricks provides an SDK to generate high-quality synthetic questions and ground-truth answers that can be used directly in Agent Evaluation, or sent to subject-matter experts for review. See Synthesize evaluation sets.

A good evaluation set has the following characteristics:

Representative: It should accurately reflect the range of requests the application will encounter in production.
Challenging: It should include difficult and diverse cases to effectively test the full range of the application's capabilities.
Continually updated: It should be updated regularly to reflect how the application is used and the changing patterns of production traffic.

For the required schema of an evaluation set, see Agent Evaluation input schema (MLflow 2).

Sample evaluations sets

This section includes simple examples of evaluation sets.

Sample evaluation set with only `request`

Python
eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
    }
]

Sample evaluation set with `request` and `expected_response`

Python
eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
    }
]

Sample evaluation set with `request`, `expected_response`, and `expected_retrieved_content`

Python
eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_1",
            },
            {
                "doc_uri": "doc_uri_2",
            },
        ],
        "expected_response": "There's no significant difference.",
    }
]

Sample evaluation set with only `request` and `response`

Python
eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
    }
]

Sample evaluation set with an arbitrarily formatted `request` and `response`

Python
eval_set = [
    {
        "request": {"query": "Difference between", "item_a": "reduceByKey", "item_b": "groupByKey"},
        "response": {
            "differences": [
                "reduceByKey aggregates data before shuffling",
                "groupByKey shuffles all data",
                "reduceByKey is more efficient",
            ]
        }
    }
]

Sample evaluation set with `request`, `response`, and `guidelines`

Python
eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        # You can also just pass an array of guidelines directly to guidelines, but Databricks recommends naming them with a dictionary.
        "guidelines": {
            "english": ["The response must be in English"],
            "clarity": ["The response must be clear, coherent, and concise"],
        }
    }
]

Sample evaluation set with `request`, `response`, `guidelines`, and `expected_facts`

Python
eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "expected_facts": [
            "There's no significant difference.",
        ],
        # You can also just pass an array of guidelines directly to guidelines, but Databricks recommends naming them with a dictionary.
        "guidelines": {
            "english": ["The response must be in English"],
            "clarity": ["The response must be clear, coherent, and concise"],
        }
    }
]

Sample evaluation set with `request`, `response`, and `retrieved_context`

Python
eval_set = [
    {
        "request_id": "request-id", # optional, but useful for tracking
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Sample evaluation set with `request`, `response`, `retrieved_context`, and `expected_facts`

Python
eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_facts": [
            "There's no significant difference.",
        ],
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Sample evaluation set with `request`, `response`, `retrieved_context`, `expected_facts`, and `expected_retrieved_context`

Python
eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_2_1",
            },
            {
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_facts": [
            "There's no significant difference.",
        ],
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Best practices for developing an evaluation set

Consider each sample, or group of samples, in the evaluation set as a unit test. That is, each sample should correspond to a specific scenario with an explicit expected outcome. For example, consider testing longer contexts, multi-hop reasoning, and ability to infer answers from indirect evidence.
Consider testing adversarial scenarios from malicious users.
There is no specific guideline on the number of questions to include in an evaluation set, but clear signals from high-quality data typically perform better than noisy signals from weak data.
Consider including examples that are very challenging, even for humans to answer.
Whether you are building a general-purpose application or targeting a specific domain, your app will likely encounter a wide variety of questions. The evaluation set should reflect that. For example, if you are creating an application to field specific HR questions, you should still consider testing other domains (for example, operations), to ensure that the application does not hallucinate or provide harmful responses.
High-quality, consistent human-generated labels are the best way to ensure that the ground truth values that you provide to the application accurately reflect the desired behavior. Some steps to ensure high-quality human labels are the following:
- Aggregate responses (labels) from multiple human labelers for the same question.
- Ensure that labeling instructions are clear and that the human labelers are consistent.
- Ensure that the conditions for the human-labeling process are identical to the format of requests submitted to the RAG application.
Human labelers are by nature noisy and inconsistent, for example due to different interpretations of the question. This is an important part of the process. Using human labeling can reveal interpretations of questions that you had not considered, and that might provide insight into behavior you observe in your application.

Sample evaluations sets​

Sample evaluation set with only request​

Sample evaluation set with request and expected_response​

Sample evaluation set with request, expected_response, and expected_retrieved_content​

Sample evaluation set with only request and response​

Sample evaluation set with an arbitrarily formatted request and response​

Sample evaluation set with request, response, and guidelines​

Sample evaluation set with request, response, guidelines, and expected_facts​

Sample evaluation set with request, response, and retrieved_context​

Sample evaluation set with request, response, retrieved_context, and expected_facts​

Sample evaluation set with request, response, retrieved_context, expected_facts, and expected_retrieved_context​

Best practices for developing an evaluation set​