Skip to main content

Building MLflow evaluation datasets

To systematically test and improve a GenAI application, you use an evaluation dataset. An evaluation dataset is a selected set of example inputs — either labeled (with known expected outputs) or unlabeled (without ground-truth answers). Evaluation datasets help you improve your app's performance in the following ways:

  • Improve quality by testing fixes against known problematic examples from production.
  • Prevent regressions. Create a "golden set" of examples that must always work correctly.
  • Compare app versions. Test different prompts, models, or app logic against the same data.
  • Target specific features. Build specialized datasets for safety, domain knowledge, or edge cases.
  • Validate the app across different environments as part of LLMOps.

MLflow evaluation datasets are stored in Unity Catalog, which provides built-in versioning, lineage, sharing, and governance.

Requirements

To create an evaluation dataset, you must have CREATE TABLE permissions on a Unity Catalog schema.

How to build an evaluation dataset

There are 3 ways to create an evaluation dataset:

This page describes how to create an MLflow evaluation dataset. You can also use other types of datasets, such as Pandas DataFrames or a list of dictionaries, to get started quickly. See MLflow evaluation examples for GenAI for examples.

Step 1: Create a dataset

The first step is to create an MLflow-managed evaluation dataset. MLflow-managed evaluation datasets track changes over time and maintain links to individual evaluation results.

Follow the recording below to use the UI to create an evaluation dataset

Create evaluation dataset using the UI

Step 2: Add records to your dataset

Approach 1: Create from existing traces

One of the most effective ways to build a relevant evaluation dataset is by curating examples directly from your application's historical interactions captured by MLflow Tracing. You can create datasets from traces using either the MLflow Monitoring UI or the SDK.

Follow the recording below to use the UI to add existing production traces to the dataset

trace

Approach 2: Create from domain expert labels

Leverage feedback from domain experts captured in MLflow Labeling Sessions to enrich your evaluation datasets with ground truth labels. Before doing these steps, follow the collect domain expert feedback guide to create a labeling session.

Python
import mlflow.genai.labeling as labeling

# Get a labeling sessions
all_sessions = labeling.get_labeling_sessions()
print(f"Found {len(all_sessions)} sessions")

for session in all_sessions:
print(f"- {session.name} (ID: {session.labeling_session_id})")
print(f" Assigned users: {session.assigned_users}")

# Sync from the labeling session to the dataset

all_sessions[0].sync(dataset_name=f"{uc_schema}.{evaluation_dataset_table_name}")

Approach 3: Build from scratch or import existing

You can import an existing dataset or curate examples from scratch. Your data must match (or be transformed to match) the evaluation dataset schema.

Python
# Define comprehensive test cases
evaluation_examples = [
{
"inputs": {"question": "What is MLflow?"},
"expected": {
"expected_response": "MLflow is an open source platform for managing the end-to-end machine learning lifecycle.",
"expected_facts": [
"open source platform",
"manages ML lifecycle",
"experiment tracking",
"model deployment"
]
},
},
]

eval_dataset = eval_dataset.merge_records(evaluation_examples)

Approach 4: Seed using synthetic data

Generating synthetic data can expand your testing efforts by quickly creating diverse inputs and covering edge cases. See Synthesize evaluation sets.

Step 3: Update existing datasets

Python
import mlflow.genai.datasets
import pandas as pd

# Load existing dataset
dataset = mlflow.genai.datasets.get_dataset(name="catalog.schema.eval_dataset")

# Add new test cases
new_cases = [
{
"inputs": {"question": "What are MLflow models?"},
"expectations": {
"expected_facts": ["model packaging", "deployment", "registry"],
"min_response_length": 100
}
}
]

# Merge new cases
dataset = dataset.merge_records(new_cases)

Limitations

  • Customer Managed Keys (CMK) are not supported.
  • Maximum of 2,000 rows per evaluation dataset.
  • Maximum of 20 expectations per dataset record.

If you need any of these limitations relaxed for your use case, contact your Databricks representative.

Next steps

Reference guides