Evaluation dataset reference
This page describes the evaluation dataset schema and includes links to the SDK reference for some of the most frequently used methods and classes.
For general information and examples of how to use evaluation datasets, see Evaluation harness.
Evaluation dataset schema
Evaluation datasets must use the schema described in this section.
Core fields
The following fields are used in both the evaluation dataset abstraction or if you pass data directly.
Column | Data Type | Description | Required |
|---|---|---|---|
|
| Inputs for your app (e.g., user question, context), stored as a JSON-seralizable | Yes |
|
| Ground truth labels, stored as a JSON-seralizable | Optional |
expectations reserved keys
expectations has several reserved keys that are used by built-in LLM judges: guidelines, expected_facts, and expected_response.
Field | Used by | Description |
|---|---|---|
|
| List of facts that should appear |
|
| Exact or similar expected output |
|
| Natural language rules to follow |
|
| Documents that should be retrieved |
Additional fields
The following fields are used by the evaluation dataset abstraction to track lineage and version history.
Column | Data Type | Description | Required |
|---|---|---|---|
| string | The unique identifier for the record. | Automatically set if not provided. |
| timestamp | The time when the record was created. | Automatically set when inserting or updating. |
| string | The user who created the record. | Automatically set when inserting or updating. |
| timestamp | The time when the record was last updated. | Automatically set when inserting or updating. |
| string | The user who last updated the record. | Automatically set when inserting or updating. |
| struct | The source of the dataset record. See Source field. | Optional |
| dict[str, Any] | Key-value tags for the dataset record. | Optional |
Source field
The source field tracks where a dataset record came from. Each record can have only one source type.
Human source: Record created manually by a person
{
"source": {
"human": {
"user_name": "jane.doe@company.com" # user who created the record
}
}
}
Document source: Record synthesized from a document
{
"source": {
"document": {
"doc_uri": "s3://bucket/docs/product-manual.pdf", # URI or path to the source document
"content": "The first 500 chars of the document..." # Optional, excerpt or full content from the document
}
}
}
Trace source: Record created from a production trace
{
"source": {
"trace": {
"trace_id": "tr-abc123def456". # unique identifier of the source trace
}
}
}
MLflow evaluation dataset UI

MLflow evaluation dataset SDK reference
The evaluation datasets SDK provides programmatic access to create, manage, and use datasets for GenAI app evaluation. For details, see the API reference: mlflow.genai.datasets. Some of the most frequently used methods and classes are the following:
mlflow.genai.datasets.create_datasetmlflow.genai.datasets.get_datasetmlflow.genai.datasets.delete_datasetEvaluationDataset. This class provides methods to interact with and modify evaluation datasets.