Evaluation and monitoring reference
This page provides reference documentation for MLflow evaluation and monitoring concepts. For guides and tutorials, see Evaluate and Monitor AI agents.
For MLflow 3 evaluation and monitoring API documentation, see API Reference.
Quick reference
| Concept | Purpose | Usage | 
|---|---|---|
| Evaluate trace quality | 
 | |
| LLM-based assessment | Wrapped in scorers for use | |
| Run offline evaluation | ||
| Test data management | 
 | |
| Store evaluation results | Created by harness | |
| Live quality tracking | 
 | 
Scorers: mlflow.genai.scorers
Functions that evaluate traces and return Feedback.
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
from typing import Optional, Dict, Any, List
@scorer
def my_custom_scorer(
    *,  # MLflow calls your scorer with named arguments
    inputs: Optional[Dict[Any, Any]],  # App's input from trace
    outputs: Optional[Dict[Any, Any]],  # App's output from trace
    expectations: Optional[Dict[str, Any]],  # Ground truth (offline only)
    trace: Optional[mlflow.entities.Trace]  # Complete trace
) -> int | float | bool | str | Feedback | List[Feedback]:
    # Your evaluation logic
    return Feedback(value=True, rationale="Explanation")
Judges
LLM Judges are a type of MLflow Scorer that uses Large Language Models for quality assessment. While code-based Scorers use programmatic logic, judges leverage the reasoning capabilities of LLMs to evaluate criteria like helpfulness, relevance, safety, and beyond.
from mlflow.genai.scorers import Safety, RelevanceToQuery
# Initialize judges that will assess different quality aspects
safety_judge = Safety()  # Checks for harmful, toxic, or inappropriate content
relevance_judge = RelevanceToQuery()  # Checks if responses are relevant to user queries
# Run evaluation on your test dataset with multiple judges
mlflow.genai.evaluate(
    data=eval_data,  # Your test cases (inputs, outputs, optional ground truth)
    predict_fn=my_app,  # The application function you want to evaluate
    scorers=[safety_judge, relevance_judge]  # Both judges run on every test case
)
Evaluation Harness: mlflow.genai.evaluate(...)
Orchestrates offline evaluation during development.
import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery
results = mlflow.genai.evaluate(
    data=eval_dataset,  # Test data
    predict_fn=my_app,  # Your app
    scorers=[Safety(), RelevanceToQuery()],  # Quality metrics
    model_id="models:/my-app/1"  # Optional version tracking
)
Learn more about Evaluation Harness
Evaluation Datasets: mlflow.genai.datasets.EvaluationDataset
Versioned test data with optional ground truth.
import mlflow.genai.datasets
# Create from production traces
dataset = mlflow.genai.datasets.create_dataset(
    uc_table_name="catalog.schema.eval_data"
)
# Add traces
traces = mlflow.search_traces(filter_string="trace.status = 'OK'")
dataset.insert(traces)
# Use in evaluation
results = mlflow.genai.evaluate(data=dataset, ...)
Learn more about Evaluation Datasets
Evaluation Runs: mlflow.entities.Run
Results from evaluation containing traces with feedback.
# Access evaluation results
traces = mlflow.search_traces(run_id=results.run_id)
# Filter by feedback
good_traces = traces[traces['assessments'].apply(
    lambda x: all(a.value for a in x if a.name == 'Safety')
)]
Learn more about Evaluation Runs
Production Monitoring
This feature is in Beta.
Continuous evaluation of deployed applications.
import mlflow
from mlflow.genai.scorers import Safety, ScorerSamplingConfig
# Register the scorer with a name and start monitoring
safety_judge = Safety().register(name="my_safety_judge")  # name must be unique to experiment
safety_judge = safety_judge.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))