Migrate to MLflow 3 from Agent Evaluation
Agent Evaluation is now integrated with MLflow 3 on Databricks. The Agent Evaluation SDK methods are now exposed through the mlflow[databricks]>=3.1
SDK, under the mlflow.genai
namespace. MLflow 3 introduces:
- Refreshed UI that mirrors all SDK functionality
- New SDK
mlflow.genai
with simplified APIs for running evaluation, human labeling, and managing evaluation datasets - Enhanced tracing with a production-scale trace ingestion backend that provides real-time observability
- Streamlined human feedback collection
- Improved LLM judges as built-in scorers
This guide helps you migrate from Agent Evaluation (MLflow 2.x with databricks-agents<1.0
) to MLflow 3. This detailed guide is also available in a quick reference format.
MLflow 3 with Agent Evaluation only works on Managed MLflow, not open source MLflow. View the managed vs. open source MLflow page to understand the differences between managed and open source MLflow in more depth.
Migration checklist
Get started by using this checklist. Each item links to details in sections below.
Evaluation API
- Update imports from
databricks.agents
tomlflow.genai.*
- Convert
@metric
decorators to@scorer
- Replace
mlflow.evaluate()
withmlflow.genai.evaluate()
LLM judges
- Replace direct judge calls with predefined scorers where possible
- Update
judges.guideline_adherence()
tojudges.meets_guidelines()
orGuidelines()
scorer - Update judge function parameter names to match new API
- Consider using
ExpectationGuidelines()
for ground-truth based guidelines
Human feedback
- Update Labeling Session & Review App imports to
mlflow.genai.labeling
- Update label schema imports to
mlflow.genai.label_schemas
- Update logic for syncing feedback to datasets
Common pitfalls to avoid
- Remember to update data field names in your DataFrames
- Remember that
model_type="databricks-agent"
is no longer needed - Ensure custom scorers return valid values ("yes"/"no" for pass/fail)
- Use
search_traces()
instead of accessing result tables directly - Update any hardcoded namespace references in your code
- Remember to explicitly specify all scorers - MLflow 3 does not automatically run judges
- Convert
global_guidelines
from config to explicitGuidelines()
scorers
Evaluation API migration
Import updates
The list below summarizes imports to update, with details and examples in each subsection below.
# Old imports
from mlflow import evaluate
from databricks.agents.evals import metric
from databricks.agents.evals import judges
# New imports
from mlflow.genai import evaluate
from mlflow.genai.scorers import scorer
from mlflow.genai import judges
# For predefined scorers:
from mlflow.genai.scorers import (
Correctness, Guidelines, ExpectationGuidelines,
RelevanceToQuery, Safety, RetrievalGroundedness,
RetrievalRelevance, RetrievalSufficiency
)
From mlflow.evaluate()
to mlflow.genai.evaluate()
The core evaluation API has moved to a dedicated GenAI namespace with cleaner parameter names.
MLflow 2.x | MLflow 3.x | Notes |
---|---|---|
|
| New namespace |
|
| More descriptive name |
| Not needed | Automatically detected |
|
| Clearer terminology |
| Not needed | Part of scorers |
MLflow 2.x Field | MLflow 3.x Field | Description |
---|---|---|
|
| Agent input |
|
| Agent output |
|
| Ground truth |
| Accessed via traces | Context from trace |
| Part of scorer config | Moved to scorer level |
Example: Basic evaluation
MLflow 2.x:
import mlflow
import pandas as pd
eval_data = [
{
"request": "What is MLflow?",
"response": "MLflow is an open-source platform for managing ML lifecycle.",
"expected_response": "MLflow is an open-source platform for managing ML lifecycle.",
},
{
"request": "What is Databricks?",
"response": "Databricks is a unified analytics platform.",
"expected_response": "Databricks is a unified analytics platform for big data and AI.",
},
]
# Note: By default, MLflow 2.x runs all applicable judges automatically
results = mlflow.evaluate(
data=eval_data,
model=my_agent,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
# Optional: limit to specific judges
# "metrics": ["correctness", "safety"],
# Optional: add global guidelines
"global_guidelines": {
"clarity": ["Response must be clear and concise"]
}
}
}
)
# Access results
eval_df = results.tables['eval_results']
MLflow 3.x:
import mlflow
import pandas as pd
from mlflow.genai.scorers import Guidelines
eval_data = [
{
"inputs": {"request": "What is MLflow?"},
"outputs": {
"response": "MLflow is an open-source platform for managing ML lifecycle."
},
"expectations": {
"expected_response":
"MLflow is an open-source platform for managing ML lifecycle.",
},
},
{
"inputs": {"request": "What is Databricks?"},
"outputs": {"response": "Databricks is a unified analytics platform."},
"expectations": {
"expected_response":
"Databricks is a unified analytics platform for big data and AI.",
},
},
]
# Define guidelines for scorer
guidelines = {
"clarity": ["Response must be clear and concise"],
# supports str or list[str]
"accuracy": "Response must be factually accurate",
}
print("Running evaluation with mlflow.genai.evaluate()...")
with mlflow.start_run(run_name="basic_evaluation_test") as run:
# Run evaluation with new API
# Note: Must explicitly specify which scorers to run (no automatic selection)
results = mlflow.genai.evaluate(
data=eval_data,
scorers=[
Correctness(), # Requires expectations.expected_response
RelevanceToQuery(), # No ground truth needed
Guidelines(name="clarity", guidelines=guidelines["clarity"]),
Guidelines(name="accuracy", guidelines=guidelines["accuracy"]),
# ExpectationsGuidelines(),
# Add more scorers as needed: Safety(), RetrievalGroundedness(), etc.
],
)
# Access results using search_traces
traces = mlflow.search_traces(
run_id=results.run_id,
)
Accessing evaluation results
In MLflow 3, evaluation results are stored as traces with assessments. Use mlflow.search_traces()
to access detailed results:
# Access results using search_traces
traces = mlflow.search_traces(
run_id=results.run_id,
)
# Access assessments for each trace
for trace in traces:
assessments = trace.info.assessments
for assessment in assessments:
print(f"Scorer: {assessment.name}")
print(f"Value: {assessment.value}")
print(f"Rationale: {assessment.rationale}")
Evaluating an MLflow LoggedModel
In MLflow 2.x, you could pass a logged MLflow model (such as a PyFunc model or one logged by the Agent Framework) directly to mlflow.evaluate()
. In MLflow 3.x, you need to wrap the model in a predict function to handle parameter mapping.
This wrapper is necessary because mlflow.genai.evaluate()
expects a predict function that accepts the keys in the inputs
dict from your dataset as keyword arguments, while most logged models accept a single input parameter (e.g., model_inputs
for PyFunc models or similar interfaces for LangChain models).
The predict function serves as a translation layer between the evaluation framework's named parameters and the model's expected input format.
import mlflow
from mlflow.genai.scorers import Safety
# Make sure to load your logged model outside of the predict_fn so MLflow only loads it once!
model = mlflow.pyfunc.load_model("models:/chatbot/staging")
def evaluate_model(question: str) -> dict:
return model.predict({"question": question})
results = mlflow.genai.evaluate(
data=[{"inputs": {"question": "Tell me about MLflow"}}],
predict_fn=evaluate_model,
scorers=[Safety()]
)
Custom metrics to scorers migration
Custom evaluation functions (@metric
) now use the @scorer
decorator with a simplified signature.
MLflow 2.x | MLflow 3.x | Notes |
---|---|---|
|
| New name |
|
| Simplified |
Multiple expected_* params | Single | Consolidated |
| Part of | Simplified |
|
| Consistent naming |
|
| Consistent naming |
Example: Pass/fail scorer
MLflow 2.x:
from databricks.agents.evals import metric
@metric
def response_length_check(request, response, expected_response=None):
"""Check if response is within acceptable length."""
length = len(response)
return "yes" if 50 <= length <= 500 else "no"
# Use in evaluation
results = mlflow.evaluate(
data=eval_data,
model=my_agent,
model_type="databricks-agent",
extra_metrics=[response_length_check]
)
MLflow 3.x:
import mlflow
from mlflow.genai.scorers import scorer
# Sample agent function
@mlflow.trace
def my_agent(request: str):
"""Simple mock agent for testing - MLflow 3 expects dict input"""
responses = {
"What is MLflow?": "MLflow is an open-source platform for managing ML lifecycle.",
"What is Databricks?": "Databricks is a unified analytics platform.",
}
return {"response": responses.get(request, "I don't have information about that.")}
@scorer
def response_length_check(inputs, outputs, expectations=None, traces=None):
"""Check if response is within acceptable length."""
length = len(outputs)
return "yes" if 50 <= length <= 500 else "no"
# Use in evaluation
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[response_length_check]
)
Example: Numeric scorer with Assessment
MLflow 2.x:
from databricks.agents.evals import metric, Assessment
def calculate_similarity(response, expected_response):
return 1
@metric
def semantic_similarity(response, expected_response):
"""Calculate semantic similarity score."""
# Your similarity logic here
score = calculate_similarity(response, expected_response)
return Assessment(
name="semantic_similarity",
value=score,
rationale=f"Similarity score based on embedding distance: {score:.2f}"
)
MLflow 3.x:
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
@scorer
def semantic_similarity(outputs, expectations):
"""Calculate semantic similarity score."""
# Your similarity logic here
expected = expectations.get("expected_response", "")
score = calculate_similarity(outputs, expected)
return Feedback(
name="semantic_similarity",
value=score,
rationale=f"Similarity score based on embedding distance: {score:.2f}"
)
LLM judges migration
Key differences in judge behavior
MLflow 2.x | MLflow 3.x |
---|---|
Automatically runs all applicable judges based on data | Must explicitly specify which scorers to use |
Use | Pass desired scorers in |
| Use |
Judges selected based on available data fields | You control exactly which scorers run |
MLflow 2.x automatic judge selection:
- Without ground truth: runs
chunk_relevance
,groundedness
,relevance_to_query
,safety
,guideline_adherence
- With ground truth: also runs
context_sufficiency
,correctness
MLflow 3.x explicit scorer selection:
- You must explicitly list scorers you want to run
- More control but requires being explicit about evaluation needs
Migration paths
Use Case | MLflow 2.x | MLflow 3.x Recommended |
---|---|---|
Basic correctness check |
|
|
Safety evaluation |
|
|
Global guidelines |
|
|
Per-eval-set-row guidelines |
|
|
Check for factual support |
|
|
Check relevance of context |
|
|
Check relevance of context chunks |
|
|
Check completeness of context |
|
|
Complex custom logic | Direct judge calls in | Predefined scorers or |
MLflow 3 provides two ways to use LLM judges:
- Predefined scorers - Ready-to-use scorers that wrap judges with automatic trace parsing
- Direct judge calls - Call judges directly within custom scorers for more control
Controlling which judges run
Example: Specifying judges to run
MLflow 2.x (limiting default judges):
import mlflow
# By default, runs all applicable judges
# Use evaluator_config to limit which judges run
results = mlflow.evaluate(
data=eval_data,
model=my_agent,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
# Only run these specific judges
"metrics": ["groundedness", "relevance_to_query", "safety"]
}
}
)
MLflow 3.x (explicit scorer selection):
from mlflow.genai.scorers import (
RetrievalGroundedness,
RelevanceToQuery,
Safety
)
# Must explicitly specify which scorers to run
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[
RetrievalGroundedness(),
RelevanceToQuery(),
Safety()
]
)
Comprehensive migration example
This example shows migrating an evaluation that uses multiple judges with custom configuration:
MLflow 2.x:
from databricks.agents.evals import judges, metric
import mlflow
# Custom metric using judge
@metric
def check_no_pii(request, response, retrieved_context):
"""Check if retrieved context contains PII."""
context_text = '\n'.join([c['content'] for c in retrieved_context])
return judges.guideline_adherence(
request=request,
guidelines=["The context must not contain personally identifiable information."],
guidelines_context={"retrieved_context": context_text}
)
# Define global guidelines
global_guidelines = {
"tone": ["Response must be professional and courteous"],
"format": ["Response must use bullet points for lists"]
}
# Run evaluation with multiple judges
results = mlflow.evaluate(
data=eval_data,
model=my_agent,
model_type="databricks-agent",
evaluator_config={
"databricks-agent": {
# Specify subset of built-in judges
"metrics": ["correctness", "groundedness", "safety"],
# Add global guidelines
"global_guidelines": global_guidelines
}
},
# Add custom judge
extra_metrics=[check_no_pii]
)
MLflow 3.x:
from mlflow.genai.scorers import (
Correctness,
RetrievalGroundedness,
Safety,
Guidelines,
scorer
)
from mlflow.genai import judges
import mlflow
# Custom scorer using judge
@scorer
def check_no_pii(inputs, outputs, traces):
"""Check if retrieved context contains PII."""
# Extract retrieved context from trace
retrieved_context = traces.data.spans[0].attributes.get("retrieved_context", [])
context_text = '\n'.join([c['content'] for c in retrieved_context])
return judges.meets_guidelines(
name="no_pii",
context={
"request": inputs,
"retrieved_context": context_text
},
guidelines=["The context must not contain personally identifiable information."]
)
# Run evaluation with explicit scorers
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[
# Built-in scorers (explicitly specified)
Correctness(),
RetrievalGroundedness(),
Safety(),
# Global guidelines as scorers
Guidelines(name="tone", guidelines="Response must be professional and courteous"),
Guidelines(name="format", guidelines="Response must use bullet points for lists"),
# Custom scorer
check_no_pii
]
)
Migrating to predefined judge scorers
MLflow 3 provides predefined scorers that wrap the LLM judges, making them easier to use with mlflow.genai.evaluate()
.
Example: Correctness judge
MLflow 2.x:
from databricks.agents.evals import judges, metric
@metric
def check_correctness(request, response, expected_response):
"""Check if response is correct."""
return judges.correctness(
request=request,
response=response,
expected_response=expected_response
)
# Use in evaluation
results = mlflow.evaluate(
data=eval_data,
model=my_agent,
model_type="databricks-agent",
extra_metrics=[check_correctness]
)
MLflow 3.x (Option 1: Using predefined scorer):
from mlflow.genai.scorers import Correctness
# Use predefined scorer directly
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[Correctness()]
)
MLflow 3.x (Option 2: Custom scorer with judge):
from mlflow.genai.scorers import scorer
from mlflow.genai import judges
@scorer
def check_correctness(inputs, outputs, expectations):
"""Check if response is correct."""
return judges.correctness(
request=inputs,
response=outputs,
expected_response=expectations.get("expected_response", "")
)
# Use in evaluation
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[check_correctness]
)
Example: Safety judge
MLflow 2.x:
from databricks.agents.evals import judges, metric
@metric
def check_safety(request, response):
"""Check if response is safe."""
return judges.safety(
request=request,
response=response
)
MLflow 3.x:
from mlflow.genai.scorers import Safety
# Use predefined scorer
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[Safety()]
)
Example: Relevance judge
MLflow 2.x:
from databricks.agents.evals import judges, metric
@metric
def check_relevance(request, response):
"""Check if response is relevant to query."""
return judges.relevance_to_query(
request=request,
response=response
)
MLflow 3.x:
from mlflow.genai.scorers import RelevanceToQuery
# Use predefined scorer
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[RelevanceToQuery()]
)
Example: Groundedness judge
MLflow 2.x:
from databricks.agents.evals import judges, metric
@metric
def check_groundedness(response, retrieved_context):
"""Check if response is grounded in context."""
context_text = '\n'.join([c['content'] for c in retrieved_context])
return judges.groundedness(
response=response,
context=context_text
)
MLflow 3.x:
from mlflow.genai.scorers import RetrievalGroundedness
# Use predefined scorer (automatically extracts context from trace)
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[RetrievalGroundedness()]
)
Migrating guideline adherence to meets_guidelines
The guideline_adherence
judge has been renamed to meets_guidelines
with a cleaner API.
MLflow 2.x:
from databricks.agents.evals import judges, metric
@metric
def check_tone(request, response):
"""Check if response follows tone guidelines."""
return judges.guideline_adherence(
request=request,
response=response,
guidelines=["The response must be professional and courteous."]
)
@metric
def check_policies(request, response, retrieved_context):
"""Check if response follows company policies."""
context_text = '\n'.join([c['content'] for c in retrieved_context])
return judges.guideline_adherence(
request=request,
guidelines=["Response must comply with return policy in context."],
guidelines_context={
"response": response,
"retrieved_context": context_text
}
)
MLflow 3.x (Option 1: Using predefined Guidelines scorer):
from mlflow.genai.scorers import Guidelines
# For simple guidelines that only need request/response
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[
Guidelines(
name="tone",
guidelines="The response must be professional and courteous."
)
]
)
MLflow 3.x (Option 2: Custom scorer with meets_guidelines):
from mlflow.genai.scorers import scorer
from mlflow.genai import judges
@scorer
def check_policies(inputs, outputs, traces):
"""Check if response follows company policies."""
# Extract retrieved context from trace
retrieved_context = traces.data.spans[0].attributes.get("retrieved_context", [])
context_text = '\n'.join([c['content'] for c in retrieved_context])
return judges.meets_guidelines(
name="policy_compliance",
guidelines="Response must comply with return policy in context.",
context={
"request": inputs,
"response": outputs,
"retrieved_context": context_text
}
)
Example: Migrating ExpectationGuidelines
For guidelines contained in an evaluation set, use ExpectationGuidelines
:
MLflow 2.x:
from databricks.agents.evals import judges, metric
# Define `guidelines` as a key in your evaluation set
@metric
def check_completeness(request, response, expected_facts):
"""Check if response includes all expected facts."""
facts_text = '\n'.join(expected_facts)
return judges.guideline_adherence(
request=request,
guidelines=[f"Response must include all of these facts: {facts_text}"],
guidelines_context={"response": response}
)
MLflow 3.x:
from mlflow.genai.scorers import ExpectationGuidelines
# Define `guidelines` as a key `expectations` in your evaluation set
# Use predefined scorer
results = mlflow.genai.evaluate(
data=eval_data, # eval_data must include expectations.expected_facts
predict_fn=my_agent,
scorers=[
ExpectationGuidelines()
]
)
Replicating MLflow 2.x automatic judge behavior
To replicate MLflow 2.x behavior of running all applicable judges, explicitly include all scorers:
MLflow 2.x (automatic):
# Automatically runs all applicable judges based on data
results = mlflow.evaluate(
data=eval_data, # Contains expected_response and retrieved_context
model=my_agent,
model_type="databricks-agent"
)
MLflow 3.x (explicit):
from mlflow.genai.scorers import (
Correctness, RetrievalSufficiency, # Require ground truth
RelevanceToQuery, Safety, RetrievalGroundedness, RetrievalRelevance # No ground truth
)
# Manually specify all judges you want to run
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[
# With ground truth judges
Correctness(),
RetrievalSufficiency(),
# Without ground truth judges
RelevanceToQuery(),
Safety(),
RetrievalGroundedness(),
RetrievalRelevance(),
]
)
Direct judge usage
You can still call judges directly for testing:
from mlflow.genai import judges
# Test a judge directly (same in both versions)
result = judges.correctness(
request="What is MLflow?",
response="MLflow is an open-source platform for ML lifecycle.",
expected_response="MLflow is an open-source platform for managing the ML lifecycle."
)
print(f"Judge result: {result.value}")
print(f"Rationale: {result.rationale}")
Human feedback migration
Labeling sessions and schemas
The Review App functionality has moved from databricks.agents
to mlflow.genai.labeling
.
Namespace changes:
MLflow 2.x | MLflow 3.x |
---|---|
|
|
|
|
|
|
|
|
Example: Creating a labeling session
MLflow 2.x:
from databricks.agents import review_app
import mlflow
# Get review app
my_app = review_app.get_review_app()
# Create custom label schema
quality_schema = my_app.create_label_schema(
name="response_quality",
type="feedback",
title="Rate the response quality",
input=review_app.label_schemas.InputCategorical(
options=["Poor", "Fair", "Good", "Excellent"]
)
)
# Create labeling session
session = my_app.create_labeling_session(
name="quality_review_jan_2024",
agent="my_agent",
assigned_users=["user1@company.com", "user2@company.com"],
label_schemas=[
review_app.label_schemas.EXPECTED_FACTS,
"response_quality"
]
)
# Add traces for labeling
traces = mlflow.search_traces(run_id=run_id)
session.add_traces(traces)
MLflow 3.x:
import mlflow
import mlflow.genai.labeling as labeling
import mlflow.genai.label_schemas as schemas
# Create custom label schema
quality_schema = schemas.create_label_schema(
name="response_quality",
type=schemas.LabelSchemaType.FEEDBACK,
title="Rate the response quality",
input=schemas.InputCategorical(
options=["Poor", "Fair", "Good", "Excellent"]
),
overwrite=True
)
# Previously built in schemas must be created before use
# However, constant for their names are provided to ensure your schemas work with built-in scorers
expected_facts_schema = schemas.create_label_schema(
name=schemas.EXPECTED_FACTS,
type=schemas.LabelSchemaType.EXPECTATION,
title="Expected facts",
input=schemas.InputTextList(max_length_each=1000),
instruction="Please provide a list of facts that you expect to see in a correct response.",
overwrite=True
)
# Create labeling session
session = labeling.create_labeling_session(
name="quality_review_jan_2024",
assigned_users=["user1@company.com", "user2@company.com"],
label_schemas=[
schemas.EXPECTED_FACTS,
"response_quality"
]
)
# Add traces for labeling
traces = mlflow.search_traces(
run_id=session.mlflow_run_id
)
session.add_traces(traces)
# Get review app URL
app = labeling.get_review_app()
print(f"Review app URL: {app.url}")
Syncing feedback to datasets
MLflow 2.x:
# Sync expectations back to dataset
session.sync(to_dataset="catalog.schema.eval_dataset")
# Use dataset for evaluation
dataset = spark.read.table("catalog.schema.eval_dataset")
results = mlflow.evaluate(
data=dataset,
model=my_agent,
model_type="databricks-agent"
)
MLflow 3.x:
from mlflow.genai import datasets
import mlflow
# Sample agent function
@mlflow.trace
def my_agent(request: str):
"""Simple mock agent for testing - MLflow 3 expects dict input"""
responses = {
"What is MLflow?": "MLflow is an open-source platform for managing ML lifecycle.",
"What is Databricks?": "Databricks is a unified analytics platform.",
}
return {"response": responses.get(request, "I don't have information about that.")}
# Sync expectations back to dataset
session.sync(to_dataset="catalog.schema.eval_dataset")
# Use dataset for evaluation
dataset = datasets.get_dataset("catalog.schema.eval_dataset")
results = mlflow.genai.evaluate(
data=dataset,
predict_fn=my_agent
)
Additional resources
- MLflow 3 GenAI Evaluation Guide
- Custom Scorers Documentation
- Human Feedback with Labeling Sessions
- Predefined Judge Scorers
- MLflow Tracing Guide
For additional support during migration, consult the MLflow documentation or reach out to your Databricks support team.