Evaluation sets
Preview
This feature is in Public Preview.
To measure the quality of an agentic application, you need to be able to define what a high-quality, accurate response looks like. You do that by providing an evaluation set. This article covers the required schema of the evaluation set, which metrics are calculated based on what data is present in the evaluation set, and some best practices for creating an evaluation set.
Databricks recommends creating a human-labeled evaluation set. This is a set of representative questions and ground-truth answers. You can also optionally provide the supporting documents that you expect the response to be based on if your application includes a retrieval step.
A good evaluation set has the following characteristics:
Representative: It should accurately reflect the range of requests the application will encounter in production.
Challenging: It should include difficult and diverse cases to effectively test the full range of the application’s capabilities.
Continually updated: It should be updated regularly to reflect how the application is used and the changing patterns of production traffic.
To learn how to run an evaluation using the evaluation set, see How to run an evaluation and view the results.
Evaluation set schema
The following table shows the schema required for the DataFrame provided in the mlflow.evaluate()
call.
Column |
Data type |
Description |
Application passed as input argument |
Previously generated outputs provided |
---|---|---|---|---|
request_id |
string |
Unique identifier of request. |
Optional |
Optional |
request |
string |
Input to the application to evaluate, user’s question or query. For example, “What is RAG?” See Schema for request. |
Required |
Required |
expected_retrieved_context |
array |
Array of objects containing the expected retrieved context for the request (if the application includes a retrieval step). Array schema |
Optional |
Optional |
expected_response |
string |
Ground-truth (correct) answer for the input request. See expected_response guidelines. |
Optional |
Optional |
response |
string |
Response generated by the application being evaluated. |
Generated by Agent Evaluation |
Optional. If not provided then derived from the Trace. Either |
retrieved_context |
array |
Retrieval results generated by the retriever in the application being evaluated. If multiple retrieval steps are in the application, this is the retrieval results from the last step (chronologically in the trace). Array schema |
Generated by Agent Evaluation |
Optional. If not provided then derived from the provided trace. |
trace |
JSON string of MLflow Trace |
MLflow Trace of the application’s execution on the corresponding request. |
Generated by Agent Evaluation |
Optional. Either |
expected_response
guidelines
The ground truth expected_response
should include only the minimal set of facts that is required for a correct response. If you copy a response from another source, be sure to edit the response to remove any text that is not required for an answer to be considered correct.
Including only the required information, and leaving out information that is not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.
Schema for request
The request schema can be one of the following:
A plain string. This format supports single-turn conversations only.
A
messages
field that follows the OpenAI chat completion schema and can encode the full conversation.A
query
string field for the most recent request and an optionalhistory
field that encodes previous turns of the conversation.
For multi-turn chat applications, use the second or third option above.
The following example shows all three options in the same request
column of the evaluation dataset:
import pandas as pd
data = {
"request": [
# Plain string
"What is the difference between reduceByKey and groupByKey in Spark?",
# Using the `messages` field for a single- or multi-turn chat
{
"messages": [
{
"role": "user",
"content": "How can you minimize data shuffling in Spark?"
}
]
},
# Using the query and history fields for a single- or multi-turn chat
{
"query": "Explain broadcast variables in Spark. How do they enhance performance?",
"history": [
{
"role": "user",
"content": "What are broadcast variables?"
},
{
"role": "assistant",
"content": "Broadcast variables allow the programmer to keep a read-only variable cached on each machine."
}
]
}
],
"expected_response": [
"expected response for first question",
"expected response for second question",
"expected response for third question"
]
}
eval_dataset = pd.DataFrame(data)
Schema for arrays in evaluation set
The schema of the arrays expected_retrieved_context
and retrieved_context
is shown in the following table:
Column |
Data type |
Description |
Application passed as input argument |
Previously generated outputs provided |
---|---|---|---|---|
content |
string |
Contents of the retrieved context. String in any format, such as HTML, plain text, or Markdown. |
Optional |
Optional |
doc_uri |
string |
Unique identifier (URI) of the parent document where the chunk came from. |
Required |
Required |
Metrics available when the application is passed in through the model
input argument
The metrics calculated are determined by the data you provide in the evaluation set. The table shows the dependencies for evaluations that take the application as an input argument. The columns indicate the data included in the evaluation set, and an X
indicates that the metric is supported when that data is provided.
For details about what these metrics measure, see Use agent metrics & LLM judges to evaluate app performance.
Calculated metrics |
|
|
|
---|---|---|---|
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
|
|
✓ |
✓ |
|
|
✓ |
Sample evaluation set with only request
eval_set = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
}
]
Sample evaluation set with request
and expected_response
eval_set = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_response": "There's no significant difference.",
}
]
Sample evaluation set with request
, expected_response
, and expected_retrieved_content
eval_set = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
"doc_uri": "doc_uri_1",
},
{
"doc_uri": "doc_uri_2",
},
],
"expected_response": "There's no significant difference.",
}
]
Metrics available when application outputs are provided
The metrics calculated are determined by the data you provide in the evaluation set. The table shows the dependencies for evaluations where you provide a Dataframe with the evaluation set and application outputs. The columns indicate the data included in the evaluation set, and an X
indicates that the metric is supported when that data is provided.
Calculated metrics |
|
|
|
|
---|---|---|---|---|
|
✓ |
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
✓ |
Customer-defined LLM judges |
✓ |
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
|
✓ |
✓ |
✓ |
|
|
✓ |
✓ |
||
|
✓ |
✓ |
||
|
✓ |
Sample evaluation set with only request
and response
eval_set = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
}
]
Sample evaluation set with request
, response
, and retrieved_context
eval_set = [
{
"request_id": "request-id", # optional, but useful for tracking
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}
]
Sample evaluation set with request
, response
, retrieved_context
, and expected_response
eval_set = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_response": "There's no significant difference.",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}
]
Sample evaluation set with request
, response
, retrieved_context
, expected_response
, and expected_retrieved_context
level_4_data = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
"doc_uri": "doc_uri_2_1",
},
{
"doc_uri": "doc_uri_2_2",
},
],
"expected_response": "There's no significant difference.",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}
]
Best practices for developing an evaluation set
Consider each sample, or group of samples, in the evaluation set as a unit test. That is, each sample should correspond to a specific scenario with an explicit expected outcome. For example, consider testing longer contexts, multi-hop reasoning, and ability to infer answers from indirect evidence.
Consider testing adversarial scenarios from malicious users.
There is no specific guideline on the number of questions to include in an evaluation set, but clear signals from high-quality data typically perform better than noisy signals from weak data.
Consider including examples that are very challenging, even for humans to answer.
Whether you are building a general-purpose application or targeting a specific domain, your app will likely encounter a wide variety of questions. The evaluation set should reflect that. For example, if you are creating an application to field specific HR questions, you should still consider testing other domains (for example, operations), to ensure that the application does not hallucinate or provide harmful responses.
High-quality, consistent human-generated labels are the best way to ensure that the ground truth values that you provide to the application accurately reflect the desired behavior. Some steps to ensure high-quality human labels are the following:
Aggregate responses (labels) from multiple human labelers for the same question.
Ensure that labeling instructions are clear and that the human labelers are consistent.
Ensure that the conditions for the human-labeling process are identical to the format of requests submitted to the RAG application.
Human labelers are by nature noisy and inconsistent, for example due to different interpretations of the question. This is an important part of the process. Using human labeling can reveal interpretations of questions that you had not considered, and that might provide insight into behavior you observe in your application.