Align judges with humans
Judge alignment teaches LLM judges to match human evaluation standards through systematic feedback. This process transforms generic evaluators into domain-specific experts that understand your unique quality criteria, improving agreement with human assessments by 30 to 50 percent compared to baseline judges.
The same alignment workflow applies to both built-in judges (such as RelevanceToQuery, Safety, or Correctness) and custom judges created with make_judge(). Use alignment with built-in judges to adapt their generic criteria to your domain, or with custom judges to refine specialized evaluation logic.
Judge alignment follows a three-step workflow:
- Generate initial assessments: Use a built-in or custom judge to evaluate traces and establish a baseline.
- Collect human feedback: Domain experts review and correct judge assessments.
- Align and deploy: Invoke the judge's
align()method to create a new judge that is more aligned with human feedback.
The system supports the optimizers that are available in the package mlflow.genai.judges.optimizers.
Requirements
-
MLflow 3.4.0 or above to use judge alignment features
Python%pip install --upgrade "mlflow[databricks]>=3.4.0" databricks_openai dspy
dbutils.library.restartPython() -
A judge to align. This can be a built-in judge (for example,
RelevanceToQueryorCorrectness) or a custom judge created withmake_judge(). -
The human feedback assessment name must exactly match the judge's
nameattribute. For built-in judges, this is the default snake_case name (for example,relevance_to_queryforRelevanceToQuery) unless you override it by passingname=when instantiating the class. For custom judges, it's thenameyou passed tomake_judge()(for example,product_quality). -
Alignment is not supported for session-level (multi-turn) judges such as
ConversationCompleteness.
Step 1: Set up the judge and generate traces
Set up your initial judge and generate traces with assessments. You can achieve reasonable alignment with at least 10 traces, but 50-100 traces yield better results.
- Built-in judge
- Custom judge
Instantiate a built-in judge directly. Built-in judges expose a name attribute (the default is a snake_case string such as relevance_to_query) that you'll use when logging human feedback in Step 2.
from mlflow.genai.scorers import RelevanceToQuery
import mlflow
# Create or set an MLflow experiment for alignment.
# Use a workspace path such as /Shared/<name> or /Users/<your-email>/<name>.
experiment = mlflow.set_experiment("/Shared/relevance-alignment")
experiment_id = experiment.experiment_id
# Use a built-in judge
initial_judge = RelevanceToQuery()
Create a custom judge with make_judge(). The name argument is the same name you'll use when logging human feedback in Step 2.
from mlflow.genai.judges import make_judge
import mlflow
# Create or set an MLflow experiment for alignment.
# Use a workspace path such as /Shared/<name> or /Users/<your-email>/<name>.
experiment = mlflow.set_experiment("/Shared/product-quality-alignment")
experiment_id = experiment.experiment_id
# Create initial judge with template-based evaluation
initial_judge = make_judge(
name="product_quality",
instructions=(
"Evaluate if the product description in {{ outputs }} "
"is accurate and helpful for the query in {{ inputs }}. "
"Rate as: excellent, good, fair, or poor"
),
model="databricks:/databricks-gpt-oss-120b",
)
Define your application logic. The following example uses a Databricks-hosted foundation model to generate a product description from a query. Replace this with your own application code:
import mlflow
from databricks_openai import DatabricksOpenAI
# Enable automatic tracing of OpenAI calls
mlflow.openai.autolog()
# Create an OpenAI client connected to Databricks-hosted LLMs
client = DatabricksOpenAI()
model_name = "databricks-claude-sonnet-4"
def generate_product_description(query: str) -> str:
response = client.chat.completions.create(
model=model_name,
messages=[
{
"role": "system",
"content": "You write concise, accurate product descriptions.",
},
{"role": "user", "content": query},
],
)
return response.choices[0].message.content
Generate traces and run the judge. Use the judge's name attribute (for example, relevance_to_query for the built-in judge above, or product_quality for the custom judge above) as the feedback name:
# Generate traces for alignment (minimum 10, recommended 50+)
for i in range(50):
query = f"Tell me about product {i}"
description = generate_product_description(query)
# Retrieve the ID of the most recent finished trace
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)
# Generate judge assessment
judge_result = initial_judge(trace=trace)
# Log judge feedback to the trace using the judge's name
mlflow.log_feedback(
trace_id=trace_id,
name=initial_judge.name,
value=judge_result.value,
rationale=judge_result.rationale,
)
Step 2: Collect human feedback
Collect human feedback to teach the judge your quality standards. Choose from the following approaches:
- Databricks UI review
- Programmatic feedback
Collect human feedback when:
- You need domain experts to review outputs.
- You want to iteratively refine feedback criteria.
- You're working with a smaller dataset (< 100 examples).
Use the MLflow UI to manually review and provide feedback:
- Navigate to your MLflow experiment in the Databricks workspace.
- Click on the Traces tab to see traces.
- Review each trace and its judge assessment.
- Add human feedback using the UI's feedback interface.
- Ensure the feedback name matches your judge's
nameattribute exactly (for example,relevance_to_queryfor a built-inRelevanceToQueryinstance orproduct_qualityfor the custom judge above).
Use programmatic feedback when:
- You have pre-existing ground truth labels.
- You're working with large datasets (100+ examples).
- You need reproducible feedback collection.
If you have existing ground truth labels, log them programmatically:
from mlflow.entities import AssessmentSource, AssessmentSourceType
# Your ground truth data
ground_truth_data = [
{"trace_id": "<trace_id_1>", "label": "excellent", "rationale": "Comprehensive and accurate description"},
{"trace_id": "<trace_id_2>", "label": "poor", "rationale": "Missing key product features"},
{"trace_id": "<trace_id_3>", "label": "good", "rationale": "Accurate but could be more detailed"},
# ... more ground truth labels
]
# Log human feedback for each trace
for item in ground_truth_data:
mlflow.log_feedback(
trace_id=item["trace_id"],
name=initial_judge.name, # Must match judge name (built-in or custom)
value=item["label"],
rationale=item.get("rationale", ""),
source=AssessmentSource(
source_type=AssessmentSourceType.HUMAN,
source_id="ground_truth_dataset"
),
)
Best practices for feedback collection
- Diverse reviewers: Include multiple domain experts to capture varied perspectives
- Balanced examples: Include at least 30% negative examples (poor/fair ratings)
- Clear rationales: Provide detailed explanations for ratings
- Representative samples: Cover edge cases and common scenarios
Step 3: Align and register the judge
Once you have sufficient human feedback, align the judge. The same align() method is used for both built-in and custom judges.
- Default optimizer (recommended)
- Explicit optimizer
When you call align() without specifying an optimizer, the MemAlign optimizer is used automatically:
# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
experiment_ids=[experiment_id],
max_results=100,
return_type="list"
)
if len(traces_for_alignment) >= 10:
# Align the judge based on human feedback using the default optimizer
aligned_judge = initial_judge.align(traces_for_alignment)
# Register the aligned judge for production use.
# Use a new name to distinguish it from the original judge.
aligned_judge.register(
experiment_id=experiment_id,
name=f"{initial_judge.name}_aligned",
tags={"alignment_date": "2025-10-23", "num_traces": str(len(traces_for_alignment))}
)
print(f"Successfully aligned judge using {len(traces_for_alignment)} traces")
else:
print(f"Insufficient traces for alignment. Found {len(traces_for_alignment)}, need at least 10")
from mlflow.genai.judges.optimizers import MemAlignOptimizer
# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
experiment_ids=[experiment_id], max_results=15, return_type="list"
)
# Align the judge using human corrections (minimum 10 traces recommended)
if len(traces_for_alignment) >= 10:
# Explicitly specify optimizer with custom model configuration
optimizer = MemAlignOptimizer(model="databricks:/databricks-gpt-oss-120b")
aligned_judge = initial_judge.align(traces_for_alignment, optimizer)
# Register the aligned judge
aligned_judge.register(experiment_id=experiment_id)
print("Judge aligned successfully with human feedback")
else:
print(f"Need at least 10 traces for alignment, have {len(traces_for_alignment)}")
Enable detailed logging
To monitor the alignment process, enable debug logging for the optimizer:
import logging
# Enable detailed logging
logging.getLogger("mlflow.genai.judges.optimizers.memalign").setLevel(logging.DEBUG)
# Run alignment with verbose output
aligned_judge = initial_judge.align(traces_for_alignment)
Validate alignment
Validate that alignment improved the judge:
def test_alignment_improvement(
original_judge, aligned_judge, test_traces: list
) -> dict:
"""Compare judge performance before and after alignment."""
original_correct = 0
aligned_correct = 0
for trace in test_traces:
# Get human ground truth from trace assessments
feedbacks = trace.search_assessments(type="feedback")
human_feedback = next(
(f for f in feedbacks if f.source.source_type == "HUMAN"), None
)
if not human_feedback:
continue
# Get judge evaluations
# Judges can evaluate entire traces instead of individual inputs/outputs
original_eval = original_judge(trace=trace)
aligned_eval = aligned_judge(trace=trace)
# Check agreement with human
if original_eval.value == human_feedback.value:
original_correct += 1
if aligned_eval.value == human_feedback.value:
aligned_correct += 1
total = len(test_traces)
return {
"original_accuracy": original_correct / total,
"aligned_accuracy": aligned_correct / total,
"improvement": (aligned_correct - original_correct) / total,
}
Create custom alignment optimizers
For specialized alignment strategies, extend the AlignmentOptimizer base class:
from mlflow.genai.judges.base import AlignmentOptimizer, Judge
from mlflow.entities.trace import Trace
class MyCustomOptimizer(AlignmentOptimizer):
"""Custom optimizer implementation for judge alignment."""
def __init__(self, model: str = None, **kwargs):
"""Initialize your optimizer with custom parameters."""
self.model = model
# Add any custom initialization logic
def align(self, judge: Judge, traces: list[Trace]) -> Judge:
"""
Implement your alignment algorithm.
Args:
judge: The judge to be optimized
traces: List of traces containing human feedback
Returns:
A new Judge instance with improved alignment
"""
# Your custom alignment logic here
# 1. Extract feedback from traces
# 2. Analyze disagreements between judge and human
# 3. Generate improved instructions
# 4. Return new judge with better alignment
# Example: Return judge with modified instructions
from mlflow.genai.judges import make_judge
improved_instructions = self._optimize_instructions(judge.instructions, traces)
return make_judge(
name=judge.name,
instructions=improved_instructions,
model=judge.model,
)
def _optimize_instructions(self, instructions: str, traces: list[Trace]) -> str:
"""Your custom optimization logic."""
# Implement your optimization strategy
pass
# Create your custom optimizer
custom_optimizer = MyCustomOptimizer(model="your-model")
# Use it for alignment
aligned_judge = initial_judge.align(traces_with_feedback, custom_optimizer)
Limitations
- Judge alignment does not support agent-based or expectation-based evaluation.
Next steps
- Learn about production monitoring to deploy aligned judges at scale.
- See code-based scorers for complementary deterministic metrics.
- Learn more about building customized judges in this Databricks blog.