Concepts & Data Model

MLflow for GenAI provides a comprehensive data model designed specifically for developing, evaluating, and monitoring generative AI applications. This page explains the core concepts and how they work together.

Overview

At its core, MLflow organizes all GenAI application data within Experiments. Think of an experiment as a project folder that contains every trace, evaluation run, app version, prompt, and quality assessment from throughout your app's lifecycle.

1. Data model

Experiment: Container for a single application's data
- Observability data
  - Traces: App execution logs
    - Assessments: Quality measurements attached to a trace
- Evaluation data
  - Evaluation Datasets: Inputs for quality evaluation
  - Evaluation Runs: Results of quality evaluation
- Human labeling data
  - Labeling Sessions: Queues of traces for human labeling
  - Labeling Schemas: Structured questions to ask labelers
- Application versioning data
  - Logged Models: App version snapshots
  - Prompts: LLM prompt templates

note

MLflow only requires you to use traces. All other aspects of the data model are optional, but highly reccomended!

2. MLflow provides SDKs for interacting with your app's data to evaluate and improve quality:

_mlflow.genai.scorers._*: Functions that analyze a trace's quality, creating feedback assessments
mlflow.genai.evaluate(): SDK for evaluating an app's version using evaluation datasets and scorers to identify and improve quality issues
mlflow.genai.add_scheduled_scorer(): SDK for running scorers on production traces to monitor quality

3. MLflow provides UIs for managing and using your app's data:

Review App: Web UI for collecting domain expert assessments
MLflow Experiment UI: UIs for viewing and interacting with traces, evaluation results, labeling sessions, app versions, and prompts.

1. Data Model

Below, we provide an overview of each entity in the MLflow data model.

Experiments

An Experiment in MLflow is a named container that organizes and groups together all artifacts related to a single GenAI application. Experiments, akin to a project, ensures your applications and their data are logically separated.

If you are familar with MLflow for classic ML, the Experiment container is the same between classic ML and GenAI.

Observability data

Traces

Traces capture the complete execution of your GenAI application, including inputs, outputs, and every intermediate step (LLM calls, retrievals, tool use). Traces:

Are created automatically for every execution of your application in development and production
Are (optionally) linked to the specific application versions that generated them
Have attached assessments that contain
- Quality feedback from scorers, end users, and domain experts
- Ground truth expectations from domain experts

Traces are used to:

Observe and debug application behavior and performance (latency, cost, etc)
Create evaluation datasets based on production logs to use in quality evaluation

Learn more in the tracing data model reference, follow the quickstart to log your first trace, or follow the instrument your app guide to implement tracing in your app.

Assessments

Assessments are quality measurements and ground truth labels that are attached to a trace. There are 2 types of assessments:

Feedback: Judgments about the quality of your app's outputs
- Added by end users, domain experts, or automated scorers
- Used to identify quality issues
- Examples
  - End user's thumbs up/down rating
  - LLM judge assessment of a response's correctness
Expectations: Ground truth labels that define the correct output for a given input
- Added by domain experts
- Used as the "gold standard" for evaluating if your app produced the right response
- Examples
  - Expected response to a question
  - Required facts that must be present in a response

note

Ground truth labels (expectations) are NOT required to measure quality with MLflow. Most applications will not have or only have a minimal set of ground truth labels.

Learn more about logging assessments, see how to collect user feedback, or explore using scorers to create automated assessments.

Evaluation data

Evaluation Datasets

Evaluation Datasets are curated collections of test cases for systematically testing your application. Evaluation datasets:

Are typically created by selecting representative traces from production or development
Include inputs and optionally expectations (ground truth)
Are versioned over time to track how your test suite evolves

Evaluation datasets are used to:

Iteratively evaluate and improve your app's quality
Validate changes to prevent regressions in quality

Learn more in the evaluation datasets reference, follow the guide to build evaluation datasets, or see how to use production traces to improve your datasets.

Evaluation Runs

Evaluation Runs are the results of testing an application version against an evaluation dataset using a set of scorers. Evaluation runs:

Contain the traces (and their assessments) generated by evaluation
Contain aggregated metrics based on the assessments

Evaluation runs are used to:

Determine if application changes improved (or regressed) quality
Compare versions of your application side-by-side
Track quality evaluations over time

note

Evaluation Runs are a special type of MLflow Run and can be queried via mlflow.search_runs().

Learn more about the evaluation harness, follow the guide to use evaluation to improve your app.

Human labeling data

Labeling Sessions

Labeling Sessions organize traces for human review by domain experts. Labeling sessions:

Queue selected traces that need expert review and contain the assessments from that review
Use labeling schemas to structure the assessments experts will label

Labeling sessions are used to:

Collect expert feedback on complex or ambiguous cases
Create ground truth data for evaluation datasets

note

Labeling Sessions are a special type of MLflow Run and can be queried via mlflow.search_runs().

Learn more about labeling sessions, follow the guide to collect domain expert feedback, or see how to label during development.

Labeling Schemas

Labeling Schemas define the assessments that are collected in a labeling session, ensuring consistent label collection across domain experts. Labeling schemas:

Specify what questions to ask reviewers (e.g., "Is this response accurate?", etc)
Define the valid responses to a question (e.g., thumbs up/down, 1-5 scales, free text comments, etc)

Learn more in the labeling schemas reference or see examples in the Review App guide.

Application versioning data

Prompts

Prompts are version-controlled templates for LLM prompts. Prompts:

Are tracked with Git-like version history
Include {{variables}} for dynamic generation
Are linked to evaluation run to track their quality over time
Support aliases like "production" for deployment management

Logged Models

Logged Models represent snapshots of your application at specific points in time. Logged models:

Are linked to the traces they generate and prompts they use
Are linked to evaluation runs to track their quality
Track application parameters (e.g., LLM temperature, etc)

A logged model can either:

Act as a metadata hub, linking a conceptual application version to its specific external code (e.g., a pointer to the Git commit)
Package your application's code & config as a fully deployable artifact

Learn more about version tracking, see how to track application versions, or learn about linking traces to versions.

2. SDKs for evaluating quality

These are the key processes that evaluate the quality of traces, attach assessments to the trace containing the evaluation's results.

Scorers

mlflow.genai.scorers.* are functions that evaluate a trace's quality. Scorers:

Parse a trace for the relevant data fields to be evaluated
Use that data to evaluate quality using either deterministic code or LLM judge based evaluation criteria
Return 1+ feedback entities with the results of that evaluation

Importantly, the same scorer can be used for evaluation in development AND production.

note

Scorers vs. Judges: If you're familiar with LLM judges, you might wonder how they relate to scorers. In MLflow, a judge is a callable SDK (like mlflow.genai.judge.is_correct) that evaluates text based on specific criteria. However, judges can't directly process traces - they only understand text inputs. That's where scorers come in: they extract the relevant data from a trace (e.g., the request, response, and retrieved context) and pass it to the judge for evaluation. Think of scorers as the "adapter" that connects your traces to evaluation logic, whether that's an LLM judge or custom code.

Learn more about scorers, explore predefined LLM judges, or see how to create custom scorers.

Evaluation in development

mlflow.genai.evaluate() is MLflow's SDK for systematically evaluating the quality of your application. The evaluation harness takes an evaluation dataset, a set of scorers, and your application's prediction function as input and creates an evaluation run that contains traces with feedback assessments by:

Running your app for every record in the evaluation dataset, producing traces
Running each scorer on the resulting traces to assess quality, producing feedbacks
Attaching each feedback to the appropriate trace

The evaluation harness is used to iteratively evaluate potential improvements to your application, helping you:

Validate if the improvement improved (or regressed) quality
Identify additional improvements to further improve quality

Learn more about the evaluation harness, follow the guide to evaluate your app.

Evaluating in production

databricks.agents.create_external_monitor() allows you to schedule scorers to automatically evaluate traces from your deployed application. Once a scorer is scheduled, the production monitoring service:

Runs the scorers on production traces, producing feedbacks
Attaches each feedback to the source trace

Production monitoring is used to detect quality issues quickly and identify problematic queries or use cases to improve in development.

Learn more about production monitoring concepts, follow the guide to run scorers in production.

3. User Interfaces

Review App

The Review App is a web UI where domain experts label traces with assessments. It presents traces from labeling sessions and collects assessments based on labeling schemas.

Learn more: Review App guide

MLflow Experiment UI

The MLflow Experiment UI provides screens for:

Viewing and searching traces
Reviewing feedback and expectations on traces
Analyzing evaluation results
Managing evaluation datasets
Managing versions and prompts

Next Steps

Get Started: Follow the quickstart guide to trace your first application
Deep Dive: Explore detailed guides for tracing, evaluation, or human feedback

Overview​

1. Data Model​

Experiments​

Observability data​

Traces​

Assessments​

Evaluation data​

Evaluation Datasets​

Evaluation Runs​

Human labeling data​

Labeling Sessions​

Labeling Schemas​

Application versioning data​

Prompts​

Logged Models​

2. SDKs for evaluating quality​

Scorers​

Evaluation in development​

Evaluating in production​

3. User Interfaces​

Review App​

MLflow Experiment UI​

Next Steps​

Overview

1. Data Model

Experiments

Observability data

Traces

Assessments

Evaluation data

Evaluation Datasets

Evaluation Runs

Human labeling data

Labeling Sessions

Labeling Schemas

Application versioning data

Prompts

Logged Models

2. SDKs for evaluating quality

Scorers

Evaluation in development

Evaluating in production

3. User Interfaces

Review App

MLflow Experiment UI

Next Steps