Concepts & Data Model
MLflow for GenAI provides a comprehensive data model designed specifically for developing, evaluating, and monitoring generative AI applications. This page explains the core concepts and how they work together.
Overview
At its core, MLflow organizes all GenAI application data within Experiments. Think of an experiment as a project folder that contains every trace, evaluation run, app version, prompt, and quality assessment from throughout your app's lifecycle.
- Experiment: Container for a single application's data
- Observability data
- Traces: App execution logs
- Assessments: Quality measurements attached to a trace
- Traces: App execution logs
- Evaluation data
- Evaluation Datasets: Inputs for quality evaluation
- Evaluation Runs: Results of quality evaluation
- Human labeling data
- Labeling Sessions: Queues of traces for human labeling
- Labeling Schemas: Structured questions to ask labelers
- Application versioning data
- Logged Models: App version snapshots
- Prompts: LLM prompt templates
- Observability data
MLflow only requires you to use traces. All other aspects of the data model are optional, but highly reccomended!
2. MLflow provides SDKs for interacting with your app's data to evaluate and improve quality:
- _
mlflow.genai.scorers._
*: Functions that analyze a trace's quality, creating feedback assessments mlflow.genai.evaluate()
: SDK for evaluating an app's version using evaluation datasets and scorers to identify and improve quality issuesmlflow.genai.add_scheduled_scorer()
: SDK for running scorers on production traces to monitor quality
3. MLflow provides UIs for managing and using your app's data:
- Review App: Web UI for collecting domain expert assessments
- MLflow Experiment UI: UIs for viewing and interacting with traces, evaluation results, labeling sessions, app versions, and prompts.
1. Data Model
Below, we provide an overview of each entity in the MLflow data model.
Experiments
An Experiment in MLflow is a named container that organizes and groups together all artifacts related to a single GenAI application. Experiments, akin to a project, ensures your applications and their data are logically separated.
If you are familar with MLflow for classic ML, the Experiment container is the same between classic ML and GenAI.
Observability data
Traces
Traces capture the complete execution of your GenAI application, including inputs, outputs, and every intermediate step (LLM calls, retrievals, tool use). Traces:
- Are created automatically for every execution of your application in development and production
- Are (optionally) linked to the specific application versions that generated them
- Have attached assessments that contain
- Quality feedback from scorers, end users, and domain experts
- Ground truth expectations from domain experts
Traces are used to:
- Observe and debug application behavior and performance (latency, cost, etc)
- Create evaluation datasets based on production logs to use in quality evaluation
Learn more in the tracing data model reference, follow the quickstart to log your first trace, or follow the instrument your app guide to implement tracing in your app.
Assessments
Assessments are quality measurements and ground truth labels that are attached to a trace. There are 2 types of assessments:
- Feedback: Judgments about the quality of your app's outputs
- Added by end users, domain experts, or automated scorers
- Used to identify quality issues
- Examples
- End user's thumbs up/down rating
- LLM judge assessment of a response's correctness
- Expectations: Ground truth labels that define the correct output for a given input
- Added by domain experts
- Used as the "gold standard" for evaluating if your app produced the right response
- Examples
- Expected response to a question
- Required facts that must be present in a response
Ground truth labels (expectations) are NOT required to measure quality with MLflow. Most applications will not have or only have a minimal set of ground truth labels.
Learn more about logging assessments, see how to collect user feedback, or explore using scorers to create automated assessments.
Evaluation data
Evaluation Datasets
Evaluation Datasets are curated collections of test cases for systematically testing your application. Evaluation datasets:
- Are typically created by selecting representative traces from production or development
- Include inputs and optionally expectations (ground truth)
- Are versioned over time to track how your test suite evolves
Evaluation datasets are used to:
- Iteratively evaluate and improve your app's quality
- Validate changes to prevent regressions in quality
Learn more in the evaluation datasets reference, follow the guide to build evaluation datasets, or see how to use production traces to improve your datasets.
Evaluation Runs
Evaluation Runs are the results of testing an application version against an evaluation dataset using a set of scorers. Evaluation runs:
- Contain the traces (and their assessments) generated by evaluation
- Contain aggregated metrics based on the assessments
Evaluation runs are used to:
- Determine if application changes improved (or regressed) quality
- Compare versions of your application side-by-side
- Track quality evaluations over time
Evaluation Runs are a special type of MLflow Run and can be queried via mlflow.search_runs()
.
Learn more about the evaluation harness, follow the guide to use evaluation to improve your app.
Human labeling data
Labeling Sessions
Labeling Sessions organize traces for human review by domain experts. Labeling sessions:
- Queue selected traces that need expert review and contain the assessments from that review
- Use labeling schemas to structure the assessments experts will label
Labeling sessions are used to:
- Collect expert feedback on complex or ambiguous cases
- Create ground truth data for evaluation datasets
Labeling Sessions are a special type of MLflow Run and can be queried via mlflow.search_runs()
.
Learn more about labeling sessions, follow the guide to collect domain expert feedback, or see how to label during development.
Labeling Schemas
Labeling Schemas define the assessments that are collected in a labeling session, ensuring consistent label collection across domain experts. Labeling schemas:
- Specify what questions to ask reviewers (e.g., "Is this response accurate?", etc)
- Define the valid responses to a question (e.g., thumbs up/down, 1-5 scales, free text comments, etc)
Learn more in the labeling schemas reference or see examples in the Review App guide.
Application versioning data
Prompts
Prompts are version-controlled templates for LLM prompts. Prompts:
- Are tracked with Git-like version history
- Include
{{variables}}
for dynamic generation - Are linked to evaluation run to track their quality over time
- Support aliases like "production" for deployment management
Logged Models
Logged Models represent snapshots of your application at specific points in time. Logged models:
- Are linked to the traces they generate and prompts they use
- Are linked to evaluation runs to track their quality
- Track application parameters (e.g., LLM temperature, etc)
A logged model can either:
- Act as a metadata hub, linking a conceptual application version to its specific external code (e.g., a pointer to the Git commit)
- Package your application's code & config as a fully deployable artifact
Learn more about version tracking, see how to track application versions, or learn about linking traces to versions.
2. SDKs for evaluating quality
These are the key processes that evaluate the quality of traces, attach assessments to the trace containing the evaluation's results.
Scorers
mlflow.genai.scorers.*
are functions that evaluate a trace's quality. Scorers:
- Parse a trace for the relevant data fields to be evaluated
- Use that data to evaluate quality using either deterministic code or LLM judge based evaluation criteria
- Return 1+ feedback entities with the results of that evaluation
Importantly, the same scorer can be used for evaluation in development AND production.
Scorers vs. Judges: If you're familiar with LLM judges, you might wonder how they relate to scorers. In MLflow, a judge is a callable SDK (like mlflow.genai.judge.is_correct
) that evaluates text based on specific criteria. However, judges can't directly process traces - they only understand text inputs. That's where scorers come in: they extract the relevant data from a trace (e.g., the request, response, and retrieved context) and pass it to the judge for evaluation. Think of scorers as the "adapter" that connects your traces to evaluation logic, whether that's an LLM judge or custom code.
Learn more about scorers, explore predefined LLM judges, or see how to create custom scorers.
Evaluation in development
mlflow.genai.evaluate()
is MLflow's SDK for systematically evaluating the quality of your application. The evaluation harness takes an evaluation dataset, a set of scorers, and your application's prediction function as input and creates an evaluation run that contains traces with feedback assessments by:
- Running your app for every record in the evaluation dataset, producing traces
- Running each scorer on the resulting traces to assess quality, producing feedbacks
- Attaching each feedback to the appropriate trace
The evaluation harness is used to iteratively evaluate potential improvements to your application, helping you:
- Validate if the improvement improved (or regressed) quality
- Identify additional improvements to further improve quality
Learn more about the evaluation harness, follow the guide to evaluate your app.
Evaluating in production
databricks.agents.create_external_monitor()
allows you to schedule scorers to automatically evaluate traces from your deployed application. Once a scorer is scheduled, the production monitoring service:
- Runs the scorers on production traces, producing feedbacks
- Attaches each feedback to the source trace
Production monitoring is used to detect quality issues quickly and identify problematic queries or use cases to improve in development.
Learn more about production monitoring concepts, follow the guide to run scorers in production.
3. User Interfaces
Review App
The Review App is a web UI where domain experts label traces with assessments. It presents traces from labeling sessions and collects assessments based on labeling schemas.
Learn more: Review App guide
MLflow Experiment UI
The MLflow Experiment UI provides screens for:
- Viewing and searching traces
- Reviewing feedback and expectations on traces
- Analyzing evaluation results
- Managing evaluation datasets
- Managing versions and prompts
Next Steps
- Get Started: Follow the quickstart guide to trace your first application
- Deep Dive: Explore detailed guides for tracing, evaluation, or human feedback