Skip to main content

Evaluation runs

Evaluation runs are MLflow runs that organize and store the results of evaluating your GenAI app. An evaluation run includes the following:

  • Traces: One trace for each input in your evaluation dataset.
  • Feedback: Quality assessments from scorers attached to each trace.
  • Metrics: Aggregate statistics across all evaluated examples.
  • Metadata: Information about the evaluation configuration.

How to create evaluation runs

An evaluation run is automatically created when you call mlflow.genai.evaluate(). For more information about mlflow.genai.evaluate(), see the MLflow source code and documentation.

Python
import mlflow

# This creates an evaluation run
results = mlflow.genai.evaluate(
data=test_dataset,
predict_fn=my_app,
scorers=[correctness_scorer, safety_scorer],
experiment_name="my_app_evaluations"
)

# Access the run ID
print(f"Evaluation run ID: {results.run_id}")

Evaluation run structure

Evaluation Run
├── Run Info
│ ├── run_id: unique identifier
│ ├── experiment_id: which experiment it belongs to
│ ├── start_time: when evaluation began
│ └── status: success/failed
├── Traces (one per dataset row)
│ ├── Trace 1
│ │ ├── inputs: {"question": "What is MLflow?"}
│ │ ├── outputs: {"response": "MLflow is..."}
│ │ └── feedbacks: [correctness: 0.8, relevance: 1.0]
│ ├── Trace 2
│ └── ...
├── Aggregate Metrics
│ ├── correctness_mean: 0.85
│ ├── relevance_mean: 0.92
│ └── safety_pass_rate: 1.0
└── Parameters
├── model_version: "v2.1"
├── dataset_name: "qa_test_v1"
└── scorers: ["correctness", "relevance", "safety"]