Agent Evaluation input schema
Preview
This feature is in Public Preview.
This article explains the input schema required by Agent Evaluation to assess your application’s quality, cost, and latency.
During development, evaluation takes place offline, and an evaluation set is a required input to Agent Evaluation.
When an application is in production, all inputs to Agent Evaluation come from your inference tables or production logs.
The input schema is identical for both online and offline evaluations.
For general information about evaluation sets, see Evaluation sets.
Evaluation input schema
The following table shows Agent Evaluation’s input schema.
Column |
Data type |
Description |
Application passed as input argument |
Previously generated outputs provided |
---|---|---|---|---|
request_id |
string |
Unique identifier of request. |
Optional |
Optional |
request |
See Schema for request. |
Input to the application to evaluate, user’s question or query. For example, “What is RAG?”. |
Required |
Required |
expected_retrieved_context |
array |
Array of objects containing the expected retrieved context for the request (if the application includes a retrieval step). Array schema |
Optional |
Optional |
expected_response |
string |
Ground-truth (correct) answer for the input request. See expected_response guidelines. |
Optional |
Optional |
response |
string |
Response generated by the application being evaluated. |
Generated by Agent Evaluation |
Optional. If not provided then derived from the Trace. Either |
retrieved_context |
array |
Retrieval results generated by the retriever in the application being evaluated. If multiple retrieval steps are in the application, this is the retrieval results from the last step (chronologically in the trace). Array schema |
Generated by Agent Evaluation |
Optional. If not provided then derived from the provided trace. |
trace |
JSON string of MLflow Trace |
MLflow Trace of the application’s execution on the corresponding request. |
Generated by Agent Evaluation |
Optional. Either |
expected_response
guidelines
The ground truth expected_response
should include only the minimal set of facts that is required for a correct response. If you copy a response from another source, be sure to edit the response to remove any text that is not required for an answer to be considered correct.
Including only the required information, and leaving out information that is not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.
Schema for request
The request schema can be one of the following:
A plain string. This format supports single-turn conversations only.
A
messages
field that follows the OpenAI chat completion schema and can encode the full conversation.A
query
string field for the most recent request and an optionalhistory
field that encodes previous turns of the conversation.
For multi-turn chat applications, use the second or third option above.
The following example shows all three options in the same request
column of the evaluation dataset:
import pandas as pd
data = {
"request": [
# Plain string
"What is the difference between reduceByKey and groupByKey in Spark?",
# Using the `messages` field for a single- or multi-turn chat
{
"messages": [
{
"role": "user",
"content": "How can you minimize data shuffling in Spark?"
}
]
},
# Using the query and history fields for a single- or multi-turn chat
{
"query": "Explain broadcast variables in Spark. How do they enhance performance?",
"history": [
{
"role": "user",
"content": "What are broadcast variables?"
},
{
"role": "assistant",
"content": "Broadcast variables allow the programmer to keep a read-only variable cached on each machine."
}
]
}
],
"expected_response": [
"expected response for first question",
"expected response for second question",
"expected response for third question"
]
}
eval_dataset = pd.DataFrame(data)
Schema for arrays in evaluation input
The schema of the arrays expected_retrieved_context
and retrieved_context
is shown in the following table:
Column |
Data type |
Description |
Application passed as input argument |
Previously generated outputs provided |
---|---|---|---|---|
content |
string |
Contents of the retrieved context. String in any format, such as HTML, plain text, or Markdown. |
Optional |
Optional |
doc_uri |
string |
Unique identifier (URI) of the parent document where the chunk came from. |
Required |
Required |
Computed metrics
The columns in the following table indicate the data included in the input, and ✓
indicates that the metric is supported when that data is provided.
For details about what these metrics measure, see How quality, cost, and latency are assessed by Agent Evaluation.
Calculated metrics |
|
|
|
---|---|---|---|
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
|
✓ |
✓ |
|
|
✓ |
✓ |
|
|
✓ |