Skip to main content

Concepts & Data Model

MLflow for GenAI provides a comprehensive data model designed specifically for developing, evaluating, and monitoring generative AI applications. This page explains the core concepts and how they work together.

Overview

At its core, MLflow organizes all GenAI application data within Experiments. Think of an experiment as a project folder that contains every trace, evaluation run, app version, prompt, and quality assessment from throughout your app's lifecycle.

1. Data model

note

MLflow only requires you to use traces. All other aspects of the data model are optional, but highly reccomended!

2. MLflow provides SDKs for interacting with your app's data to evaluate and improve quality:

3. MLflow provides UIs for managing and using your app's data:

1. Data Model

Below, we provide an overview of each entity in the MLflow data model.

Experiments

An Experiment in MLflow is a named container that organizes and groups together all artifacts related to a single GenAI application. Experiments, akin to a project, ensures your applications and their data are logically separated.

If you are familar with MLflow for classic ML, the Experiment container is the same between classic ML and GenAI.

Observability data

Traces

Traces capture the complete execution of your GenAI application, including inputs, outputs, and every intermediate step (LLM calls, retrievals, tool use). Traces:

  • Are created automatically for every execution of your application in development and production
  • Are (optionally) linked to the specific application versions that generated them
  • Have attached assessments that contain
    • Quality feedback from scorers, end users, and domain experts
    • Ground truth expectations from domain experts

Traces are used to:

  • Observe and debug application behavior and performance (latency, cost, etc)
  • Create evaluation datasets based on production logs to use in quality evaluation

Learn more in the tracing data model reference, follow the quickstart to log your first trace, or follow the instrument your app guide to implement tracing in your app.

Assessments

Assessments are quality measurements and ground truth labels that are attached to a trace. There are 2 types of assessments:

  1. Feedback: Judgments about the quality of your app's outputs
    • Added by end users, domain experts, or automated scorers
    • Used to identify quality issues
    • Examples
      • End user's thumbs up/down rating
      • LLM judge assessment of a response's correctness
  2. Expectations: Ground truth labels that define the correct output for a given input
    • Added by domain experts
    • Used as the "gold standard" for evaluating if your app produced the right response
    • Examples
      • Expected response to a question
      • Required facts that must be present in a response
note

Ground truth labels (expectations) are NOT required to measure quality with MLflow. Most applications will not have or only have a minimal set of ground truth labels.

Learn more about logging assessments, see how to collect user feedback, or explore using scorers to create automated assessments.

Evaluation data

Evaluation Datasets

Evaluation Datasets are curated collections of test cases for systematically testing your application. Evaluation datasets:

  • Are typically created by selecting representative traces from production or development
  • Include inputs and optionally expectations (ground truth)
  • Are versioned over time to track how your test suite evolves

Evaluation datasets are used to:

  • Iteratively evaluate and improve your app's quality
  • Validate changes to prevent regressions in quality

Learn more in the evaluation datasets reference, follow the guide to build evaluation datasets, or see how to use production traces to improve your datasets.

Evaluation Runs

Evaluation Runs are the results of testing an application version against an evaluation dataset using a set of scorers. Evaluation runs:

Evaluation runs are used to:

  • Determine if application changes improved (or regressed) quality
  • Compare versions of your application side-by-side
  • Track quality evaluations over time
note

Evaluation Runs are a special type of MLflow Run and can be queried via mlflow.search_runs().

Learn more about the evaluation harness, follow the guide to use evaluation to improve your app.

Human labeling data

Labeling Sessions

Labeling Sessions organize traces for human review by domain experts. Labeling sessions:

Labeling sessions are used to:

  • Collect expert feedback on complex or ambiguous cases
  • Create ground truth data for evaluation datasets
note

Labeling Sessions are a special type of MLflow Run and can be queried via mlflow.search_runs().

Learn more about labeling sessions, follow the guide to collect domain expert feedback, or see how to label during development.

Labeling Schemas

Labeling Schemas define the assessments that are collected in a labeling session, ensuring consistent label collection across domain experts. Labeling schemas:

  • Specify what questions to ask reviewers (e.g., "Is this response accurate?", etc)
  • Define the valid responses to a question (e.g., thumbs up/down, 1-5 scales, free text comments, etc)

Learn more in the labeling schemas reference or see examples in the Review App guide.

Application versioning data

Prompts

Prompts are version-controlled templates for LLM prompts. Prompts:

  • Are tracked with Git-like version history
  • Include {{variables}} for dynamic generation
  • Are linked to evaluation run to track their quality over time
  • Support aliases like "production" for deployment management

Logged Models

Logged Models represent snapshots of your application at specific points in time. Logged models:

  • Are linked to the traces they generate and prompts they use
  • Are linked to evaluation runs to track their quality
  • Track application parameters (e.g., LLM temperature, etc)

A logged model can either:

  • Act as a metadata hub, linking a conceptual application version to its specific external code (e.g., a pointer to the Git commit)
  • Package your application's code & config as a fully deployable artifact

Learn more about version tracking, see how to track application versions, or learn about linking traces to versions.

2. SDKs for evaluating quality

These are the key processes that evaluate the quality of traces, attach assessments to the trace containing the evaluation's results.

Scorers

mlflow.genai.scorers.* are functions that evaluate a trace's quality. Scorers:

  • Parse a trace for the relevant data fields to be evaluated
  • Use that data to evaluate quality using either deterministic code or LLM judge based evaluation criteria
  • Return 1+ feedback entities with the results of that evaluation

Importantly, the same scorer can be used for evaluation in development AND production.

note

Scorers vs. Judges: If you're familiar with LLM judges, you might wonder how they relate to scorers. In MLflow, a judge is a callable SDK (like mlflow.genai.judge.is_correct) that evaluates text based on specific criteria. However, judges can't directly process traces - they only understand text inputs. That's where scorers come in: they extract the relevant data from a trace (e.g., the request, response, and retrieved context) and pass it to the judge for evaluation. Think of scorers as the "adapter" that connects your traces to evaluation logic, whether that's an LLM judge or custom code.

Learn more about scorers, explore predefined LLM judges, or see how to create custom scorers.

Evaluation in development

mlflow.genai.evaluate() is MLflow's SDK for systematically evaluating the quality of your application. The evaluation harness takes an evaluation dataset, a set of scorers, and your application's prediction function as input and creates an evaluation run that contains traces with feedback assessments by:

  • Running your app for every record in the evaluation dataset, producing traces
  • Running each scorer on the resulting traces to assess quality, producing feedbacks
  • Attaching each feedback to the appropriate trace

The evaluation harness is used to iteratively evaluate potential improvements to your application, helping you:

  • Validate if the improvement improved (or regressed) quality
  • Identify additional improvements to further improve quality

Learn more about the evaluation harness, follow the guide to evaluate your app.

Evaluating in production

databricks.agents.create_external_monitor() allows you to schedule scorers to automatically evaluate traces from your deployed application. Once a scorer is scheduled, the production monitoring service:

  • Runs the scorers on production traces, producing feedbacks
  • Attaches each feedback to the source trace

Production monitoring is used to detect quality issues quickly and identify problematic queries or use cases to improve in development.

Learn more about production monitoring concepts, follow the guide to run scorers in production.

3. User Interfaces

Review App

The Review App is a web UI where domain experts label traces with assessments. It presents traces from labeling sessions and collects assessments based on labeling schemas.

Learn more: Review App guide

MLflow Experiment UI

The MLflow Experiment UI provides screens for:

Next Steps