Skip to main content

Building MLflow Evaluation Datasets

This guide shows you the various ways to create evaluation datasets in order to systematically test and improve your GenAI application's quality. You'll learn multiple approaches to build datasets that enable consistent, repeatable evaluation as you iterate on your app.

Evaluation datasets help you:

  • Fix known issues: Add problematic examples from production to repeatedly test fixes
  • Prevent regressions: Create a "golden set" of examples that must always work correctly
  • Compare versions: Test different prompts, models, or app logic against the same data
  • Target specific features: Build specialized datasets for safety, domain knowledge, or edge cases

Start with a single well-curated dataset, then expand to multiple datasets as your testing needs grow.

What you'll learn:

  • Create datasets from production traces to test real-world scenarios
  • Build datasets from scratch for targeted testing of specific features
  • Import existing evaluation data from CSV, JSON, or other formats
  • Generate synthetic test data to expand coverage
  • Add ground truth labels from domain expert feedback
note

This guide shows you how to use MLflow-managed evaluation datasets, which provide version history and lineage tracking. For rapid prototyping, you can also provide your evaluation dataset as a Python dictionary or Pandas/Spark dataframe that follows the same schema of the MLflow-managed dataset. To learn more about the evaluation dataset schema, refer to the evaluation datasets reference page.

Prerequisites

  1. Install MLflow and required packages

    Bash
    pip install --upgrade "mlflow[databricks]>=3.1.0"
  2. Create an MLflow experiment by following the setup your environment quickstart.

  3. Access to a Unity Catalog schema with CREATE TABLE permissions to create evaluation datasets.

    note

    If you're using a Databricks trial account, you have CREATE TABLE permissions on the Unity Catalog schema workspace.default.

Approaches to Building Your Dataset

MLflow offers several flexible ways to construct an evaluation dataset tailored to your needs:

Choose the method or combination of methods that best suits your current data sources and evaluation goals.

Step 1: Create a dataset

Irregardless of the method you choose, first, you must create a MLflow-managed evaluation dataset. This approach allows you to track changes to the dataset over time and link individual evaluation results to this dataset.

Follow the recording below to use the UI to create an evaluation dataset

trace

Step 2: Add records to your dataset

Approach 1: Create from existing traces

One of the most effective ways to build a relevant evaluation dataset is by curating examples directly from your application's historical interactions captured by MLflow Tracing. You can create datasets from traces using either the MLflow Monitoring UI or the SDK.

Follow the recording below to use the UI to add existing production traces to the dataset

trace

Approach 2: Create from domain expert labels

Leverage feedback from domain experts captured in MLflow Labeling Sessions to enrich your evaluation datasets with ground truth labels. Before doing these steps, follow the collect domain expert feedback guide to create a labeling session.

Python
import mlflow.genai.labeling as labeling

# Get a labeling sessions
all_sessions = labeling.get_labeling_sessions()
print(f"Found {len(all_sessions)} sessions")

for session in all_sessions:
print(f"- {session.name} (ID: {session.labeling_session_id})")
print(f" Assigned users: {session.assigned_users}")

# Sync from the labeling session to the dataset

all_sessions[0].sync(dataset_name=f"{uc_schema}.{evaluation_dataset_table_name}")

Approach 3: Build from scratch or import existing

You can import an existing dataset or curate examples from scratch. Your data must match (or be transformed to match) the evaluation dataset schema.

Python
# Define comprehensive test cases
evaluation_examples = [
{
"inputs": {"question": "What is MLflow?"},
"expected": {
"expected_response": "MLflow is an open source platform for managing the end-to-end machine learning lifecycle.",
"expected_facts": [
"open source platform",
"manages ML lifecycle",
"experiment tracking",
"model deployment"
]
},
},
]

eval_dataset.merge_records(evaluation_examples)

Approach 4: Seed using synthetic data

Generating synthetic data can expand your testing efforts by quickly creating diverse inputs and covering edge cases. To learn more, visit the synthesize evaluation datasets reference.

Next steps

Continue your journey with these recommended actions and tutorials.

Reference guides

Explore detailed documentation for concepts and features mentioned in this guide.