Skip to main content

Evaluation and monitoring reference

This page provides reference documentation for MLflow evaluation and monitoring concepts. For guides and tutorials, see Evaluation and monitoring.

API Reference

For MLflow 3 evaluation and monitoring API documentation, see API Reference.

Quick reference

Concept

Purpose

Usage

Scorers

Evaluate trace quality

@scorer decorator or Scorer class

Judges

LLM-based assessment

Wrapped in scorers for use

Evaluation Harness

Run offline evaluation

mlflow.genai.evaluate()

Evaluation Datasets

Test data management

mlflow.genai.datasets

Evaluation Runs

Store evaluation results

Created by harness

Production Monitoring

Live quality tracking

Scorer.register, Scorer.start

Scorers: mlflow.genai.scorers

Functions that evaluate traces and return Feedback.

Python
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
from typing import Optional, Dict, Any, List

@scorer
def my_custom_scorer(
*, # MLflow calls your scorer with named arguments
inputs: Optional[Dict[Any, Any]], # App's input from trace
outputs: Optional[Dict[Any, Any]], # App's output from trace
expectations: Optional[Dict[str, Any]], # Ground truth (offline only)
trace: Optional[mlflow.entities.Trace] # Complete trace
) -> int | float | bool | str | Feedback | List[Feedback]:
# Your evaluation logic
return Feedback(value=True, rationale="Explanation")

Learn more about Scorers

Judges: mlflow.genai.judges

LLM-based quality assessors that must be wrapped in scorers.

Python
from mlflow.genai.judges import is_safe, is_relevant
from mlflow.genai.scorers import scorer

# Direct usage
feedback = is_safe(content="Hello world")

# Wrapped in scorer
@scorer
def safety_scorer(outputs):
return is_safe(content=outputs["response"])

Learn more about Judges

Evaluation Harness: mlflow.genai.evaluate(...)

Orchestrates offline evaluation during development.

Python
import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery

results = mlflow.genai.evaluate(
data=eval_dataset, # Test data
predict_fn=my_app, # Your app
scorers=[Safety(), RelevanceToQuery()], # Quality metrics
model_id="models:/my-app/1" # Optional version tracking
)

Learn more about Evaluation Harness

Evaluation Datasets: mlflow.genai.datasets.EvaluationDataset

Versioned test data with optional ground truth.

Python
import mlflow.genai.datasets

# Create from production traces
dataset = mlflow.genai.datasets.create_dataset(
uc_table_name="catalog.schema.eval_data"
)

# Add traces
traces = mlflow.search_traces(filter_string="trace.status = 'OK'")
dataset.insert(traces)

# Use in evaluation
results = mlflow.genai.evaluate(data=dataset, ...)

Learn more about Evaluation Datasets

Evaluation Runs: mlflow.entities.Run

Results from evaluation containing traces with feedback.

Python
# Access evaluation results
traces = mlflow.search_traces(run_id=results.run_id)

# Filter by feedback
good_traces = traces[traces['assessments'].apply(
lambda x: all(a.value for a in x if a.name == 'Safety')
)]

Learn more about Evaluation Runs

Production Monitoring

Beta

This feature is in Beta.

Continuous evaluation of deployed applications.

Python
import mlflow
from mlflow.genai.scorers import Safety, ScorerSamplingConfig

# Register the scorer with a name and start monitoring
safety_scorer = Safety().register(name="my_safety_scorer") # name must be unique to experiment
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))

Learn more about Production Monitoring