Alinhar juízes com humanos

O alinhamento de juízes ensina os juízes do LLM a corresponderem aos padrões de avaliação humana por meio de feedback sistemático. Esse processo transforma avaliadores genéricos em especialistas específicos da área, que entendem seus critérios de qualidade exclusivos, melhorando a concordância com as avaliações humanas em 30 a 50% em comparação com os avaliadores de referência.

O alinhamento dos juízes segue um fluxo de trabalho de três etapas:

Gerar avaliações iniciais : Criar um modelo de avaliação e analisar os dados para estabelecer uma linha de base.
Coletar feedback humano : Especialistas da área revisam e corrigem as avaliações dos juízes.
Alinhar e implantar : Use o otimizador SIMBA para aprimorar o juiz com base no feedback humano.

O sistema utiliza a estratégia de otimização default Simplified Multi-Bootstrap Aggregation (SIMBA), aproveitando a implementação do DSPy para refinar iterativamente as instruções de avaliação.

Requisitos

MLflow 3.4.0 ou acima para usar o recurso de alinhamento de juiz

Python
%pip install --upgrade "mlflow[databricks]>=3.4.0"
dbutils.library.restartPython()

Criei um juiz usando make_judge()
O nome da avaliação de feedback humano deve corresponder exatamente ao nome do avaliador. Por exemplo, se o seu juiz se chamar product_quality, o seu feedback humano deve usar o mesmo nome product_quality.
O alinhamento funciona com juízes criados usando make_judge() com avaliação baseada em padrão.

o passo 1: Criar juiz e gerar rastros

Crie um juiz inicial e gere registros com avaliações. Você precisa de pelo menos 10 traços, mas 50 a 100 traços proporcionam melhores resultados de alinhamento.

Python
from mlflow.genai.judges import make_judge
import mlflow

# Create an MLflow experiment for alignment
experiment_id = mlflow.create_experiment("product-quality-alignment")
mlflow.set_experiment(experiment_id=experiment_id)

# Create initial judge with template-based evaluation
initial_judge = make_judge(
    name="product_quality",
    instructions=(
        "Evaluate if the product description in {{ outputs }} "
        "is accurate and helpful for the query in {{ inputs }}. "
        "Rate as: excellent, good, fair, or poor"
    ),
    model="databricks:/databricks-gpt-oss-120b",
)

Gere rastros e execute o juiz:

Python
# Generate traces for alignment (minimum 10, recommended 50+)
traces = []
for i in range(50):
    with mlflow.start_span(f"product_description_{i}") as span:
        # Your application logic here
        query = f"Tell me about product {i}"
        description = generate_product_description(query)  # Replace with your application logic

        # Log inputs and outputs
        span.set_inputs({"query": query})
        span.set_outputs({"description": description})
        traces.append(span.trace_id)

# Run initial judge on all traces
for trace_id in traces:
    trace = mlflow.get_trace(trace_id)
    inputs = trace.data.spans[0].inputs
    outputs = trace.data.spans[0].outputs

    # Generate judge assessment
    judge_result = initial_judge(inputs=inputs, outputs=outputs)

    # Log judge feedback to the trace
    mlflow.log_feedback(
        trace_id=trace_id,
        name="product_quality",
        value=judge_result.value,
        rationale=judge_result.rationale,
    )

a etapa 2: Coletar feedback humano

Reúna feedback humano para ensinar ao avaliador seus padrões de qualidade. Escolha uma das seguintes abordagens:

Databricks UI review
Programmatic feedback

Coletar feedback humano quando:

Você precisa de especialistas na área para revisar os resultados.
Você deseja refinar iterativamente os critérios de feedback.
Você está trabalhando com um dataset menor (< 100 exemplos)

Utilize a interface do usuário do MLflow para revisar manualmente e fornecer feedback:

Navegue até seu experimento MLflow no workspace Databricks
Clique na tab Avaliação para visualizar os resultados.
Analise cada vestígio e sua respectiva avaliação pelo juiz.
Adicione feedback humano usando a interface de feedback da interface do usuário.
Certifique-se de que o nome do feedback corresponda exatamente ao nome do seu avaliador ("qualidade_do_produto")

Utilize o feedback programático quando:

Você tem dados de referência pré-existentes
Você está trabalhando com um conjunto de dados grande (mais de 100 exemplos).
Você precisa de coleta de feedback reproduzível.

Se você já possui rótulos de referência, log os programaticamente:

Python
from mlflow.entities import AssessmentSource, AssessmentSourceType

# Your ground truth data
ground_truth_data = [
    {"trace_id": traces[0], "label": "excellent", "rationale": "Comprehensive and accurate description"},
    {"trace_id": traces[1], "label": "poor", "rationale": "Missing key product features"},
    {"trace_id": traces[2], "label": "good", "rationale": "Accurate but could be more detailed"},
    # ... more ground truth labels
]

# Log human feedback for each trace
for item in ground_truth_data:
    mlflow.log_feedback(
        trace_id=item["trace_id"],
        name="product_quality",  # Must match judge name
        value=item["label"],
        rationale=item.get("rationale", ""),
        source=AssessmentSource(
            source_type=AssessmentSourceType.HUMAN,
            source_id="ground_truth_dataset"
        ),
    )

Melhores práticas para coleta de feedback

Revisores diversos : Inclua vários especialistas da área para capturar perspectivas variadas.
Exemplos equilibrados : Inclua pelo menos 30% de exemplos negativos (classificações ruins/regulares).
Justificativas claras : Forneça explicações detalhadas para as classificações.
Exemplos representativos : Abrangem casos extremos e cenários comuns.

o passo 3: Alinhar e registrar o juiz

Assim que tiver feedback humano suficiente, selecione o juiz:

Default optimizer (recommended)
Explicit optimizer

MLflow fornece o otimizador de alinhamento default usando a implementação SIMBA (Simplified Multi-Bootstrap Aggregation) do DSPy. Quando você chama align() sem especificar um otimizador, o otimizador SIMBA é usado automaticamente:

Python
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer

# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
    experiment_ids=[experiment_id],
    max_results=100,
    return_type="list"
)

# Filter for traces with both judge and human feedback
# Only traces with both assessments can be used for alignment
valid_traces = []
for trace in traces_for_alignment:
    feedbacks = trace.search_assessments(name="product_quality")
    has_judge = any(f.source.source_type == "LLM_JUDGE" for f in feedbacks)
    has_human = any(f.source.source_type == "HUMAN" for f in feedbacks)
    if has_judge and has_human:
        valid_traces.append(trace)

if len(valid_traces) >= 10:
    # Create SIMBA optimizer with Databricks model
    optimizer = SIMBAAlignmentOptimizer(
        model="databricks:/databricks-gpt-oss-120b"
    )

    # Align the judge based on human feedback
    aligned_judge = initial_judge.align(optimizer, valid_traces)

    # Register the aligned judge for production use
    aligned_judge.register(
        experiment_id=experiment_id,
        name="product_quality_aligned",
        tags={"alignment_date": "2025-10-23", "num_traces": str(len(valid_traces))}
    )

    print(f"Successfully aligned judge using {len(valid_traces)} traces")
else:
    print(f"Insufficient traces for alignment. Found {len(valid_traces)}, need at least 10")

Python
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer

# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
    experiment_ids=[experiment_id], max_results=15, return_type="list"
)

# Align the judge using human corrections (minimum 10 traces recommended)
if len(traces_for_alignment) >= 10:
    # Explicitly specify SIMBA with custom model configuration
    optimizer = SIMBAAlignmentOptimizer(model="databricks:/databricks-gpt-oss-120b")
    aligned_judge = initial_judge.align(optimizer, traces_for_alignment)

    # Register the aligned judge
    aligned_judge.register(experiment_id=experiment_id)
    print("Judge aligned successfully with human feedback")
else:
    print(f"Need at least 10 traces for alignment, have {len(traces_for_alignment)}")

Ativar registro detalhado

Para monitorar o processo de alinhamento, habilite o registro de depuração para o otimizador SIMBA:

Python
import logging

# Enable detailed SIMBA logging
logging.getLogger("mlflow.genai.judges.optimizers.simba").setLevel(logging.DEBUG)

# Run alignment with verbose output
aligned_judge = initial_judge.align(optimizer, valid_traces)

Validar alinhamento

Confirme se o alinhamento melhorou o juiz:

Python

def test_alignment_improvement(
    original_judge, aligned_judge, test_traces: list
) -> dict:
    """Compare judge performance before and after alignment."""

    original_correct = 0
    aligned_correct = 0

    for trace in test_traces:
        # Get human ground truth from trace assessments
        feedbacks = trace.search_assessments(type="feedback")
        human_feedback = next(
            (f for f in feedbacks if f.source.source_type == "HUMAN"), None
        )

        if not human_feedback:
            continue

        # Get judge evaluations
        # Judges can evaluate entire traces instead of individual inputs/outputs
        original_eval = original_judge(trace=trace)
        aligned_eval = aligned_judge(trace=trace)

        # Check agreement with human
        if original_eval.value == human_feedback.value:
            original_correct += 1
        if aligned_eval.value == human_feedback.value:
            aligned_correct += 1

    total = len(test_traces)
    return {
        "original_accuracy": original_correct / total,
        "aligned_accuracy": aligned_correct / total,
        "improvement": (aligned_correct - original_correct) / total,
    }

Crie otimizadores de alinhamento personalizados

Para estratégias de alinhamento especializadas, estenda a classe base AlignmentOptimizer :

Python
from mlflow.genai.judges.base import AlignmentOptimizer, Judge
from mlflow.entities.trace import Trace

class MyCustomOptimizer(AlignmentOptimizer):
    """Custom optimizer implementation for judge alignment."""

    def __init__(self, model: str = None, **kwargs):
        """Initialize your optimizer with custom parameters."""
        self.model = model
        # Add any custom initialization logic

    def align(self, judge: Judge, traces: list[Trace]) -> Judge:
        """
        Implement your alignment algorithm.

        Args:
            judge: The judge to be optimized
            traces: List of traces containing human feedback

        Returns:
            A new Judge instance with improved alignment
        """
        # Your custom alignment logic here
        # 1. Extract feedback from traces
        # 2. Analyze disagreements between judge and human
        # 3. Generate improved instructions
        # 4. Return new judge with better alignment

        # Example: Return judge with modified instructions
        from mlflow.genai.judges import make_judge

        improved_instructions = self._optimize_instructions(judge.instructions, traces)

        return make_judge(
            name=judge.name,
            instructions=improved_instructions,
            model=judge.model,
        )

    def _optimize_instructions(self, instructions: str, traces: list[Trace]) -> str:
        """Your custom optimization logic."""
        # Implement your optimization strategy
        pass

# Create your custom optimizer
custom_optimizer = MyCustomOptimizer(model="your-model")

# Use it for alignment
aligned_judge = initial_judge.align(traces_with_feedback, custom_optimizer)

Limitações

O alinhamento de juízes não suporta a avaliação baseada no agente ou na expectativa.

Próximos passos

Conheça o monitoramento de produção para juízes alinhados implantados na escala
Consulte os avaliadores baseados em código para métricas determinísticas complementares.

Requisitos​

o passo 1: Criar juiz e gerar rastros​

a etapa 2: Coletar feedback humano​

Melhores práticas para coleta de feedback​

o passo 3: Alinhar e registrar o juiz​

Ativar registro detalhado​

Validar alinhamento​

Crie otimizadores de alinhamento personalizados​

Limitações​

Próximos passos​