Migrate to MLflow 3 from Agent Evaluation: Quick reference
This quick reference summarizes key changes for migrating from Agent Evaluation and MLflow 2 to the improved APIs in MLflow 3. See the full guide at Migrate to MLflow 3 from Agent Evaluation.
Import updates
### Old imports ###
from mlflow import evaluate
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from databricks.agents import review_app
### New imports ###
from mlflow.genai import evaluate
from mlflow.genai.scorers import scorer
from mlflow.genai import judges
# For predefined scorers:
from mlflow.genai.scorers import (
Correctness, Guidelines, ExpectationGuidelines,
RelevanceToQuery, Safety, RetrievalGroundedness,
RetrievalRelevance, RetrievalSufficiency
)
import mlflow.genai.labeling as labeling
import mlflow.genai.label_schemas as schemas
Evaluation function
MLflow 2.x | MLflow 3.x |
---|---|
|
|
|
|
| (not needed) |
|
|
| (configuration in scorers) |
Judge selection
MLflow 2.x | MLflow 3.x |
---|---|
Automatically runs all applicable judges based on data | Must explicitly specify which scorers to use |
Use | Pass desired scorers in |
| Use |
Judges selected based on available data fields | You control exactly which scorers run |
Data fields
MLflow 2.x Field | MLflow 3.x Field | Description |
---|---|---|
|
| Agent input |
|
| Agent output |
|
| Ground truth |
| Accessed via traces | Context from trace |
| Part of scorer config | Moved to scorer level |
Custom metrics and scorers
MLflow 2.x | MLflow 3.x | Notes |
---|---|---|
|
| New name |
|
| Simplified |
Multiple expected_* params | Single | Consolidated |
| Part of | Simplified |
|
| Consistent naming |
|
| Consistent naming |
Result access
MLflow 2.x | MLflow 3.x |
---|---|
|
|
Direct DataFrame access | Iterate through traces and assessments |
LLM judges
Use Case | MLflow 2.x | MLflow 3.x Recommended |
---|---|---|
Basic correctness check |
|
|
Safety evaluation |
|
|
Global guidelines |
|
|
Per-eval-set-row guidelines |
|
|
Check for factual support |
|
|
Check relevance of context |
|
|
Check relevance of context chunks |
|
|
Check completeness of context |
|
|
Complex custom logic | Direct judge calls in | Predefined scorers or |
Human feedback
MLflow 2.x | MLflow 3.x |
---|---|
|
|
|
|
|
|
|
|
Common migration commands
# Find old evaluate calls
grep -r "mlflow.evaluate" . --include="*.py"
# Find old metric decorators
grep -r "@metric" . --include="*.py"
# Find old data fields
grep -r '"request":\|"response":\|"expected_response":' . --include="*.py"
# Find old imports
grep -r "databricks.agents" . --include="*.py"
Additional resources
- MLflow 3 GenAI Evaluation Guide
- Custom Scorers Documentation
- Human Feedback with Labeling Sessions
- Predefined Judge Scorers
- MLflow Tracing Guide
For additional support during migration, consult the MLflow documentation or reach out to your Databricks support team.