Pular para o conteúdo principal

Juízes baseados em prontidão

Visão geral

Os juízes baseados em prontidão permitem uma avaliação de qualidade em vários níveis com categorias de escolha personalizáveis (por exemplo, excelente/boa/ruim) e pontuação numérica opcional. Ao contrário dos juízes baseados em diretrizes que fornecem avaliação binária de aprovação/reprovação, os juízes baseados em diretrizes oferecem:

  • Níveis de pontuação graduados com mapeamento numérico para acompanhamento de melhorias
  • Controle imediato total para critérios de avaliação complexos e multidimensionais
  • Categorias específicas de domínio adaptadas ao seu caso de uso
  • Métricas agregáveis para medir as tendências de qualidade ao longo do tempo

Quando usar

Escolha juízes baseados em instruções quando precisar:

  • Avaliação de qualidade em vários níveis além da aprovação/reprovação
  • Pontuações numéricas para análise quantitativa e comparação de versões
  • Critérios de avaliação complexos que exigem categorias personalizadas
  • Agregar métricas em todo o conjunto de dados

Escolha juízes baseados em diretrizes quando precisar:

  • Avaliação simples de aprovação/reprovação compliance
  • Partes interessadas da empresa devem escrever/atualizar critérios sem codificação
  • Iteração rápida nas regras de avaliação
important

Embora os juízes baseados em prompt possam ser usados como uma API/SDK autônoma, eles devem ser agrupados em um Scorer para serem usados pelo Evaluation Harness e pelo serviço de monitoramento de produção.

Pré-requisitos para executar os exemplos

  1. Instale o site MLflow e o pacote necessário

    Bash
    pip install --upgrade "mlflow[databricks]>=3.1.0"
  2. Crie um experimento MLflow seguindo o início rápido de configuração do ambiente.

Exemplo

Aqui está um exemplo simples que demonstra o poder dos juízes baseados em prontidão:

Python
from mlflow.genai.judges import custom_prompt_judge

# Create a multi-level quality judge
response_quality_judge = custom_prompt_judge(
name="response_quality",
prompt_template="""Evaluate the quality of this customer service response:

<request>{{request}}</request>
<response>{{response}}</response>

Choose the most appropriate rating:

[[excellent]]: Empathetic, complete solution, proactive help offered
[[good]]: Addresses the issue adequately with professional tone
[[poor]]: Incomplete, unprofessional, or misses key concerns""",
numeric_values={
"excellent": 1.0,
"good": 0.7,
"poor": 0.0
}
)

# Direct usage
feedback = response_quality_judge(
request="My order arrived damaged!",
response="I'm so sorry to hear that. I've initiated a replacement order that will arrive tomorrow, and issued a full refund. Is there anything else I can help with?"
)

print(feedback.value) # 1.0
print(feedback.metadata) # {"string_value": "excellent"}
print(feedback.rationale) # Detailed explanation of the rating

Conceitos fundamentais

Visão geral do SDK

A função custom_prompt_judge cria um juiz LLM personalizado que avalia as entradas com base em seu padrão de prompt:

Python
from mlflow.genai.judges import custom_prompt_judge

judge = custom_prompt_judge(
name="formality",
prompt_template="...", # Your custom prompt with {{variables}} and [[choices]]
numeric_values={"formal": 1.0, "informal": 0.0} # Optional numeric mapping
)

# Returns an mlflow.entities.Feedback object
feedback = judge(request="Hello", response="Hey there!")

Parâmetros

Parâmetro

Tipo

Obrigatório

Descrição

name

str

Sim

Nome da avaliação, exibido na interface do usuário do MLflow e usado para identificar o resultado do juiz

prompt_template

str

Sim

padrão contendo strings: - {{variables}}: espaços reservados para conteúdo dinâmico - [[choices]]: Definições de escolha obrigatórias que o juiz deve selecionar

numeric_values

dict[str, float] | None

Não

Mapeia os nomes das opções para pontuações numéricas (recomenda-se uma escala de 0 a 1). - Sem: Retorna valores de escolha de strings - Com: Retorna pontuações numéricas e armazena a escolha de strings em metadados

model

str | None

Não

Modelo de juiz específico a ser usado (padrão para o modelo de juiz otimizado do MLflow)

Por que usar mapeamento numérico?

Quando o senhor tem um rótulo de múltipla escolha (por exemplo, "excelente", "bom", "ruim"), os valores de cadeias de caracteres dificultam o rastreamento das melhorias de qualidade na execução da avaliação.

O mapeamento numérico permite:

  • Comparação quantitativa : veja se a qualidade média melhorou de 0,6 para 0,8
  • Agregação de métricas : Calcular as pontuações médias em todo o seu dataset
  • Comparação de versões : acompanhe se as alterações melhoraram ou diminuíram a qualidade
  • Monitoramento baseado em limites : Definir alerta quando a qualidade cair abaixo dos níveis aceitáveis

Sem valores numéricos, o senhor só pode ver distribuições de rótulos (por exemplo, 40% " good", 60% " poor"), o que torna mais difícil medir a melhoria geral.

Valores de retorno

A função retorna um chamável que:

  • Aceita argumentos de palavras-chave que correspondam a {{variables}} em seu padrão de prompt
  • Retorna um objeto mlflow.entities.Feedback contendo:
    • value: A opção selecionada (strings) ou pontuação numérica se numeric_values for fornecido
    • rationale: A explicação da LLM para sua escolha
    • metadata: Informações adicionais, incluindo a escolha de strings ao usar valores numéricos
    • name: O nome que você forneceu
    • error: Detalhes do erro se a avaliação falhar

Requisitos de padrão imediatos

Formato de definição de escolha

As opções devem ser definidas usando colchetes duplos [[choice_name]]:

Python
prompt_template = """Evaluate the response formality:

<request>{{request}}</request>
<response>{{response}}</response>

Select one category:

[[formal]]: Professional language, proper grammar, no contractions
[[semi_formal]]: Mix of professional and conversational elements
[[informal]]: Casual language, contractions, colloquialisms"""

Espaços reservados variáveis

Use colchetes duplos {{variable}} para conteúdo dinâmico:

Python
prompt_template = """Assess if the response uses appropriate sources:

Question: {{question}}
Response: {{response}}
Available Sources: {{retrieved_documents}}
Citation Policy: {{citation_policy}}

Choose one:

[[well_cited]]: All claims properly cite available sources
[[partially_cited]]: Some claims cite sources, others do not
[[poorly_cited]]: Claims lack proper citations"""

Padrões de avaliação comuns

Padrão de escala Likert

Criar uma escala de satisfação padrão de 5 ou 7 pontos:

Python
satisfaction_judge = custom_prompt_judge(
name="customer_satisfaction",
prompt_template="""Based on this interaction, rate the likely customer satisfaction:

Customer Request: {{request}}
Agent Response: {{response}}

Select satisfaction level:

[[very_satisfied]]: Response exceeds expectations with exceptional service
[[satisfied]]: Response meets expectations adequately
[[neutral]]: Response is acceptable but unremarkable
[[dissatisfied]]: Response fails to meet basic expectations
[[very_dissatisfied]]: Response is unhelpful or problematic""",
numeric_values={
"very_satisfied": 1.0,
"satisfied": 0.75,
"neutral": 0.5,
"dissatisfied": 0.25,
"very_dissatisfied": 0.0
}
)

Pontuação baseada em rubricas

Implemente rubricas de pontuação detalhadas com critérios claros:

Python
code_review_rubric = custom_prompt_judge(
name="code_review_rubric",
prompt_template="""Evaluate this code review using our quality rubric:

Original Code: {{original_code}}
Review Comments: {{review_comments}}
Code Type: {{code_type}}

Score the review quality:

[[comprehensive]]: Identifies all issues including edge cases, security concerns, performance implications, and suggests specific improvements with examples
[[thorough]]: Catches major issues and most minor ones, provides good suggestions but may miss some edge cases
[[adequate]]: Identifies obvious issues and provides basic feedback, misses subtle problems
[[superficial]]: Only catches surface-level issues, feedback is vague or generic
[[inadequate]]: Misses critical issues or provides incorrect feedback""",
numeric_values={
"comprehensive": 1.0,
"thorough": 0.8,
"adequate": 0.6,
"superficial": 0.3,
"inadequate": 0.0
}
)

Exemplos do mundo real

Qualidade do atendimento ao cliente

Python
from mlflow.genai.judges import custom_prompt_judge
from mlflow.genai.scorers import scorer
import mlflow

# Issue resolution status judge
resolution_judge = custom_prompt_judge(
name="issue_resolution",
prompt_template="""Evaluate if the customer's issue was resolved:

Customer Message: {{customer_message}}
Agent Response: {{agent_response}}
Issue Type: {{issue_type}}

Rate the resolution status:

[[fully_resolved]]: Issue completely addressed with clear solution provided
[[partially_resolved]]: Some progress made but follow-up needed
[[unresolved]]: Issue not addressed or solution unclear
[[escalated]]: Appropriately escalated to higher support tier""",
numeric_values={
"fully_resolved": 1.0,
"partially_resolved": 0.5,
"unresolved": 0.0,
"escalated": 0.7 # Positive score for appropriate escalation
}
)

# Empathy and tone judge
empathy_judge = custom_prompt_judge(
name="empathy_score",
prompt_template="""Assess the emotional intelligence of the response:

Customer Emotion: {{customer_emotion}}
Agent Response: {{agent_response}}

Rate the empathy shown:

[[exceptional]]: Acknowledges emotions, validates concerns, shows genuine care
[[good]]: Shows understanding and appropriate concern
[[adequate]]: Professional but somewhat impersonal
[[poor]]: Cold, dismissive, or inappropriate emotional response""",
numeric_values={
"exceptional": 1.0,
"good": 0.75,
"adequate": 0.5,
"poor": 0.0
}
)

# Create a comprehensive customer service scorer
@scorer
def customer_service_quality(inputs, outputs, trace):
"""Comprehensive customer service evaluation"""
feedbacks = []

# Evaluate resolution status
feedbacks.append(resolution_judge(
customer_message=inputs.get("message", ""),
agent_response=outputs.get("response", ""),
issue_type=inputs.get("issue_type", "general")
))

# Evaluate empathy if customer shows emotion
customer_emotion = inputs.get("detected_emotion", "neutral")
if customer_emotion in ["frustrated", "angry", "upset", "worried"]:
feedbacks.append(empathy_judge(
customer_emotion=customer_emotion,
agent_response=outputs.get("response", "")
))

return feedbacks

# Example evaluation
eval_data = [
{
"inputs": {
"message": "I've been waiting 3 weeks for my refund! This is unacceptable!",
"issue_type": "refund",
"detected_emotion": "angry"
},
"outputs": {
"response": "I completely understand your frustration - 3 weeks is far too long to wait for a refund. I'm escalating this to our finance team immediately. You'll receive your refund within 24 hours, plus a $50 credit for the inconvenience. I'm also sending you my direct email so you can reach me if there are any other delays."
}
}
]

results = mlflow.genai.evaluate(
data=eval_data,
scorers=[customer_service_quality]
)

Avaliação da qualidade do conteúdo

Python
# Technical documentation quality judge
doc_quality_judge = custom_prompt_judge(
name="documentation_quality",
prompt_template="""Evaluate this technical documentation:

Content: {{content}}
Target Audience: {{audience}}
Documentation Type: {{doc_type}}

Rate the documentation quality:

[[excellent]]: Clear, complete, well-structured with examples, appropriate depth
[[good]]: Covers topic well, mostly clear, could use minor improvements
[[fair]]: Basic coverage, some unclear sections, missing important details
[[poor]]: Confusing, incomplete, or significantly flawed""",
numeric_values={
"excellent": 1.0,
"good": 0.75,
"fair": 0.4,
"poor": 0.0
}
)

# Marketing copy effectiveness
marketing_judge = custom_prompt_judge(
name="marketing_effectiveness",
prompt_template="""Rate this marketing copy's effectiveness:

Copy: {{copy}}
Product: {{product}}
Target Demographic: {{target_demographic}}
Call to Action: {{cta}}

Evaluate effectiveness:

[[highly_effective]]: Compelling, clear value prop, strong CTA, perfect for audience
[[effective]]: Good messaging, decent CTA, reasonably targeted
[[moderately_effective]]: Some good elements but lacks impact or clarity
[[ineffective]]: Weak messaging, unclear value, poor audience fit""",
numeric_values={
"highly_effective": 1.0,
"effective": 0.7,
"moderately_effective": 0.4,
"ineffective": 0.0
}
)

Qualidade da revisão de código

Python
# Security review judge
security_review_judge = custom_prompt_judge(
name="security_review_quality",
prompt_template="""Evaluate the security aspects of this code review:

Original Code: {{code}}
Review Comments: {{review_comments}}
Security Vulnerabilities Found: {{vulnerabilities_mentioned}}

Rate the security review quality:

[[comprehensive]]: Identifies all security issues, explains risks, suggests secure alternatives
[[thorough]]: Catches major security flaws, good explanations
[[basic]]: Identifies obvious security issues only
[[insufficient]]: Misses critical security vulnerabilities""",
numeric_values={
"comprehensive": 1.0,
"thorough": 0.75,
"basic": 0.4,
"insufficient": 0.0
}
)

# Code clarity feedback judge
code_clarity_judge = custom_prompt_judge(
name="code_clarity_feedback",
prompt_template="""Assess the code review's feedback on readability:

Original Code Complexity: {{complexity_score}}
Review Feedback: {{review_comments}}
Readability Issues Identified: {{readability_issues}}

Rate the clarity feedback:

[[excellent]]: Identifies all clarity issues, suggests specific improvements, considers maintainability
[[good]]: Points out main clarity problems with helpful suggestions
[[adequate]]: Basic feedback on obvious readability issues
[[minimal]]: Superficial or missing important clarity feedback""",
numeric_values={
"excellent": 1.0,
"good": 0.7,
"adequate": 0.4,
"minimal": 0.0
}
)

Comunicação na área da saúde

Python
# Patient communication appropriateness
patient_comm_judge = custom_prompt_judge(
name="patient_communication",
prompt_template="""Evaluate this healthcare provider's response to a patient:

Patient Question: {{patient_question}}
Provider Response: {{provider_response}}
Patient Health Literacy Level: {{health_literacy}}
Sensitive Topics: {{sensitive_topics}}

Rate communication appropriateness:

[[excellent]]: Clear, compassionate, appropriate language level, addresses concerns fully
[[good]]: Generally clear and caring, minor room for improvement
[[acceptable]]: Adequate but could be clearer or more empathetic
[[poor]]: Unclear, uses too much jargon, or lacks appropriate empathy""",
numeric_values={
"excellent": 1.0,
"good": 0.75,
"acceptable": 0.5,
"poor": 0.0
}
)

# Clinical note quality
clinical_note_judge = custom_prompt_judge(
name="clinical_note_quality",
prompt_template="""Assess this clinical note's quality:

Note Content: {{note_content}}
Note Type: {{note_type}}
Required Elements: {{required_elements}}

Rate the clinical documentation:

[[comprehensive]]: All required elements present, clear, follows standards, actionable
[[complete]]: Most elements present, generally clear, minor gaps
[[incomplete]]: Missing important elements or lacks clarity
[[deficient]]: Significant gaps, unclear, or doesn't meet documentation standards""",
numeric_values={
"comprehensive": 1.0,
"complete": 0.7,
"incomplete": 0.3,
"deficient": 0.0
}
)

Comparação de respostas em pares

Use juízes baseados em instruções para comparar duas respostas e determinar qual é a melhor. Isso é útil para testes A/B, comparação de modelos ou aprendizado de preferências.

nota

Juízes de comparação de pares não podem ser usados com mlflow.evaluate() ou como pontuadores, pois avaliam duas respostas simultaneamente em vez de uma única resposta. Use-os diretamente para análise comparativa.

Python
from mlflow.genai.judges import custom_prompt_judge

# Response preference judge
preference_judge = custom_prompt_judge(
name="response_preference",
prompt_template="""Compare these two responses to the same question and determine which is better:

Question: {{question}}

Response A: {{response_a}}

Response B: {{response_b}}

Evaluation Criteria:
1. Accuracy and completeness of information
2. Clarity and ease of understanding
3. Helpfulness and actionability
4. Appropriate tone for the context

Choose your preference:

[[strongly_prefer_a]]: Response A is significantly better across most criteria
[[slightly_prefer_a]]: Response A is marginally better overall
[[equal]]: Both responses are equally good (or equally poor)
[[slightly_prefer_b]]: Response B is marginally better overall
[[strongly_prefer_b]]: Response B is significantly better across most criteria""",
numeric_values={
"strongly_prefer_a": -1.0,
"slightly_prefer_a": -0.5,
"equal": 0.0,
"slightly_prefer_b": 0.5,
"strongly_prefer_b": 1.0
}
)

# Example usage for model comparison
question = "How do I improve my GenAI app's response quality?"

response_model_v1 = """To improve response quality, you should:
1. Add more training data
2. Fine-tune your model
3. Use better prompts"""

response_model_v2 = """To improve your GenAI app's response quality, consider these strategies:

1. **Enhance your prompts**: Use clear, specific instructions with examples
2. **Implement evaluation**: Use MLflow's LLM judges to measure quality systematically
3. **Collect feedback**: Gather user feedback to identify improvement areas
4. **Iterate on weak areas**: Focus on responses that score poorly
5. **A/B test changes**: Compare versions to ensure improvements

Start with evaluation to establish a baseline, then iterate based on data."""

# Compare responses
feedback = preference_judge(
question=question,
response_a=response_model_v1,
response_b=response_model_v2
)

print(f"Preference: {feedback.metadata['string_value']}") # "strongly_prefer_b"
print(f"Score: {feedback.value}") # 1.0
print(f"Rationale: {feedback.rationale}")

Juízes de comparação especializados

Python
# Technical accuracy comparison for documentation
tech_comparison_judge = custom_prompt_judge(
name="technical_comparison",
prompt_template="""Compare these two technical explanations:

Topic: {{topic}}
Target Audience: {{audience}}

Explanation A: {{explanation_a}}

Explanation B: {{explanation_b}}

Focus on:
- Technical accuracy and precision
- Appropriate depth for the audience
- Use of examples and analogies
- Completeness without overwhelming detail

Which explanation is better?

[[a_much_better]]: A is significantly more accurate and appropriate
[[a_slightly_better]]: A is marginally better in accuracy or clarity
[[equivalent]]: Both are equally good technically
[[b_slightly_better]]: B is marginally better in accuracy or clarity
[[b_much_better]]: B is significantly more accurate and appropriate""",
numeric_values={
"a_much_better": -1.0,
"a_slightly_better": -0.5,
"equivalent": 0.0,
"b_slightly_better": 0.5,
"b_much_better": 1.0
}
)

# Empathy comparison for customer service
empathy_comparison_judge = custom_prompt_judge(
name="empathy_comparison",
prompt_template="""Compare the emotional intelligence of these customer service responses:

Customer Situation: {{situation}}
Customer Emotion: {{emotion}}

Agent Response A: {{response_a}}

Agent Response B: {{response_b}}

Evaluate which response better:
- Acknowledges the customer's emotions
- Shows genuine understanding and care
- Offers appropriate emotional support
- Maintains professional boundaries

Which response shows better emotional intelligence?

[[a_far_superior]]: A shows much better emotional intelligence
[[a_better]]: A is somewhat more empathetic
[[both_good]]: Both show good emotional intelligence
[[b_better]]: B is somewhat more empathetic
[[b_far_superior]]: B shows much better emotional intelligence""",
numeric_values={
"a_far_superior": -1.0,
"a_better": -0.5,
"both_good": 0.0,
"b_better": 0.5,
"b_far_superior": 1.0
}
)

Comparação prática fluxo de trabalho

Python
# Compare outputs from different prompt versions
def compare_prompt_versions(test_cases, prompt_v1, prompt_v2, model_client):
"""Compare two prompt versions across multiple test cases"""
results = []

for test_case in test_cases:
# Generate responses with each prompt
response_v1 = model_client.generate(prompt_v1.format(**test_case))
response_v2 = model_client.generate(prompt_v2.format(**test_case))

# Compare responses
feedback = preference_judge(
question=test_case["question"],
response_a=response_v1,
response_b=response_v2
)

results.append({
"question": test_case["question"],
"preference": feedback.metadata["string_value"],
"score": feedback.value,
"rationale": feedback.rationale
})

# Analyze results
avg_score = sum(r["score"] for r in results) / len(results)

if avg_score < -0.2:
print(f"Prompt V1 is preferred (avg score: {avg_score:.2f})")
elif avg_score > 0.2:
print(f"Prompt V2 is preferred (avg score: {avg_score:.2f})")
else:
print(f"Prompts perform similarly (avg score: {avg_score:.2f})")

return results

# Compare different model outputs
def compare_models(questions, model_a, model_b, comparison_judge):
"""Compare two models across a set of questions"""
win_counts = {"model_a": 0, "model_b": 0, "tie": 0}

for question in questions:
response_a = model_a.generate(question)
response_b = model_b.generate(question)

feedback = comparison_judge(
question=question,
response_a=response_a,
response_b=response_b
)

# Count wins based on preference strength
if feedback.value <= -0.5:
win_counts["model_a"] += 1
elif feedback.value >= 0.5:
win_counts["model_b"] += 1
else:
win_counts["tie"] += 1

print(f"Model comparison results: {win_counts}")
return win_counts

Padrões de uso avançados

Pontuação condicional

Implemente diferentes critérios de avaliação com base no contexto:

Python
@scorer
def adaptive_quality_scorer(inputs, outputs, trace):
"""Applies different judges based on context"""

# Determine which judge to use based on input characteristics
query_type = inputs.get("query_type", "general")

if query_type == "technical":
judge = custom_prompt_judge(
name="technical_response",
prompt_template="""Rate this technical response:

Question: {{question}}
Response: {{response}}
Required Depth: {{depth_level}}

[[expert]]: Demonstrates deep expertise, includes advanced concepts
[[proficient]]: Good technical accuracy, appropriate depth
[[basic]]: Correct but lacks depth or nuance
[[incorrect]]: Contains technical errors or misconceptions""",
numeric_values={
"expert": 1.0,
"proficient": 0.75,
"basic": 0.5,
"incorrect": 0.0
}
)

return judge(
question=inputs["question"],
response=outputs["response"],
depth_level=inputs.get("required_depth", "intermediate")
)

elif query_type == "support":
judge = custom_prompt_judge(
name="support_response",
prompt_template="""Rate this support response:

Issue: {{issue}}
Response: {{response}}
Customer Status: {{customer_status}}

[[excellent]]: Solves issue completely, proactive, appropriate for customer status
[[good]]: Addresses issue well, professional
[[fair]]: Partially helpful but incomplete
[[poor]]: Unhelpful or inappropriate""",
numeric_values={
"excellent": 1.0,
"good": 0.7,
"fair": 0.4,
"poor": 0.0
}
)

return judge(
issue=inputs["question"],
response=outputs["response"],
customer_status=inputs.get("customer_status", "standard")
)

Estratégias de agregação de pontuação

Combine as pontuações de vários juízes de forma inteligente:

Python
@scorer
def weighted_quality_scorer(inputs, outputs, trace):
"""Combines multiple judges with weighted scoring"""

# Define judges and their weights
judges_config = [
{
"judge": custom_prompt_judge(
name="accuracy",
prompt_template="...", # Your accuracy template
numeric_values={"high": 1.0, "medium": 0.5, "low": 0.0}
),
"weight": 0.4,
"args": {"question": inputs["question"], "response": outputs["response"]}
},
{
"judge": custom_prompt_judge(
name="completeness",
prompt_template="...", # Your completeness template
numeric_values={"complete": 1.0, "partial": 0.5, "incomplete": 0.0}
),
"weight": 0.3,
"args": {"response": outputs["response"], "requirements": inputs.get("requirements", [])}
},
{
"judge": custom_prompt_judge(
name="clarity",
prompt_template="...", # Your clarity template
numeric_values={"clear": 1.0, "adequate": 0.6, "unclear": 0.0}
),
"weight": 0.3,
"args": {"response": outputs["response"]}
}
]

# Collect all feedbacks
feedbacks = []
weighted_score = 0.0

for config in judges_config:
feedback = config["judge"](**config["args"])
feedbacks.append(feedback)

# Add to weighted score if numeric
if isinstance(feedback.value, (int, float)):
weighted_score += feedback.value * config["weight"]

# Add composite score as additional feedback
from mlflow.entities import Feedback
composite_feedback = Feedback(
name="weighted_quality_score",
value=weighted_score,
rationale=f"Weighted combination of {len(judges_config)} quality dimensions"
)
feedbacks.append(composite_feedback)

return feedbacks

Melhores práticas

Projetando escolhas efetivas

1. Faça escolhas mutuamente exclusivas e exaustivas

Python
# Good - clear distinctions, covers all cases
"""[[approved]]: Meets all requirements, ready for production
[[needs_revision]]: Has issues that must be fixed before approval
[[rejected]]: Fundamental flaws, requires complete rework"""

# Bad - overlapping and ambiguous
"""[[good]]: The response is good
[[okay]]: The response is okay
[[fine]]: The response is fine"""

2. Forneça critérios específicos para cada escolha

Python
# Good - specific, measurable criteria
"""[[secure]]: No vulnerabilities, follows all security best practices, includes input validation
[[mostly_secure]]: Minor security concerns that should be addressed but aren't critical
[[insecure]]: Contains vulnerabilities that could be exploited"""

# Bad - vague criteria
"""[[secure]]: Looks secure
[[not_secure]]: Has problems"""

3. Ordene as opções de forma lógica (do melhor para o pior)

Python
# Good - clear progression
numeric_values = {
"exceptional": 1.0,
"good": 0.75,
"satisfactory": 0.5,
"needs_improvement": 0.25,
"unacceptable": 0.0
}

Projeto de escala numérica

1. Usar escala consistente entre os juízes

Python
# All judges use 0-1 scale
quality_judge = custom_prompt_judge(..., numeric_values={"high": 1.0, "medium": 0.5, "low": 0.0})
accuracy_judge = custom_prompt_judge(..., numeric_values={"accurate": 1.0, "partial": 0.5, "wrong": 0.0})

2. Deixe lacunas para refinamento futuro

Python
# Allows adding intermediate levels later
numeric_values = {
"excellent": 1.0,
"good": 0.7, # Gap allows for "very_good" at 0.85
"fair": 0.4, # Gap allows for "satisfactory" at 0.55
"poor": 0.0
}

3. Considerar escala específica de domínio

Python
# Academic grading scale
academic_scale = {
"A": 4.0,
"B": 3.0,
"C": 2.0,
"D": 1.0,
"F": 0.0
}

# Net Promoter Score scale
nps_scale = {
"promoter": 1.0, # 9-10
"passive": 0.0, # 7-8
"detractor": -1.0 # 0-6
}

Dicas de engenharia imediatas

1. A estrutura solicita claramente

Python
prompt_template = """[Clear Task Description]
Evaluate the technical accuracy of this response.

[Context Section]
Question: {{question}}
Response: {{response}}
Technical Domain: {{domain}}

[Evaluation Criteria]
Consider: factual accuracy, appropriate depth, correct terminology

[Choice Definitions]
[[accurate]]: All technical facts correct, appropriate level of detail
[[mostly_accurate]]: Minor inaccuracies that don't affect core understanding
[[inaccurate]]: Contains significant errors or misconceptions"""

2. Inclua exemplos quando for útil

Python
prompt_template = """Assess the urgency level of this support ticket.

Ticket: {{ticket_content}}

Examples of each level:
- Critical: System down, data loss, security breach
- High: Major feature broken, blocking work
- Medium: Performance issues, non-critical bugs
- Low: Feature requests, minor UI issues

Choose urgency level:
[[critical]]: Immediate attention required, business impact
[[high]]: Urgent, significant user impact
[[medium]]: Important but not urgent
[[low]]: Can be addressed in normal workflow"""

Comparação com juízes baseados em diretrizes

Aspecto

Baseado em diretrizes

Baseado em prompts

Tipo de avaliação

Aprovação/falha binária

Categorias de vários níveis

Pontuação

" sim " ou " não "

Opções personalizadas com valores numéricos opcionais

Melhor para

conformidade, adesão à política

Avaliação de qualidade, índices de satisfação

Velocidade de iteração

Muito rápido - basta atualizar o texto da diretriz

Moderado - pode ser necessário ajustar as opções

Fácil de usar para empresas

✅ Regras de linguagem altamente natural

⚠️ Médio - requer compreensão das opções e a solicitação completa

Agregação

Conte as taxas de aprovação/reprovação

Calcule médias, acompanhe tendências

Validação e tratamento de erros

Validação de escolha

O juiz valida que:

  • As opções são definidas corretamente com o formato [[choice_name]]
  • Os nomes das opções são alfanuméricos (podem incluir sublinhado)
  • Pelo menos uma opção é definida no padrão
Python
# This will raise an error - no choices defined
invalid_judge = custom_prompt_judge(
name="invalid",
prompt_template="Rate the response: {{response}}"
)
# ValueError: Prompt template must include choices denoted with [[CHOICE_NAME]]

Validação de valores numéricos

Ao usar numeric_values, todas as opções devem ser mapeadas:

Python
# This will raise an error - missing choice in numeric_values
invalid_judge = custom_prompt_judge(
name="invalid",
prompt_template="""Choose:
[[option_a]]: First option
[[option_b]]: Second option""",
numeric_values={"option_a": 1.0} # Missing option_b
)
# ValueError: numeric_values keys must match the choices

validação da variável padrão

Variáveis padrão ausentes geram erros durante a execução:

Python
judge = custom_prompt_judge(
name="test",
prompt_template="{{request}} {{response}} [[good]]: Good"
)

# This will raise an error - missing 'response' variable
judge(request="Hello")
# KeyError: Template variable 'response' not found

Próximas etapas