メインコンテンツまでスキップ

プロンプトベースのジャッジ

概要

プロンプトベースの審査員は、カスタマイズ可能な選択カテゴリー(例:優秀/良い/悪い)とオプションの数値スコアリングによるマルチレベルの品質評価を可能にします。二項式の合否評価を提供する ガイドラインベースのジャッジ とは異なり、プロンプトベースのジャッジは以下を提供します。

  • 数値マッピングによる 段階的なスコアリングレベル により、トラッキングの改善が可能
  • 複雑で多次元の評価基準に対する 完全なプロンプト制御
  • ユースケースに合わせた ドメイン固有のカテゴリ
  • Aggregatable メトリクス により、経時的な品質傾向を測定

いつ使用するか

プロンプトベースのジャッジは、必要なときに選択してください。

  • 合格/不合格を超えたマルチレベルの品質評価
  • 定量分析とバージョン比較のための数値スコア
  • カスタムカテゴリを必要とする複雑な評価基準
  • データセット間でメトリクスを集計

ガイドラインに基づくジャッジは、以下の場合に選択してください。

  • 簡単な合否コンプライアンス評価
  • ビジネス関係者は、コーディングなしで基準を記述/更新します
  • 評価ルールの迅速な反復
important

プロンプトベースのジャッジは、スタンドアロンのAPI /SDK として使用できますが、Evaluation Harness および 本番運用 モニタリング サービスで使用するには、 Scorer でラップする必要があります。

例を実行するための前提条件

  1. MLflow と必要なパッケージをインストールする

    Bash
    pip install --upgrade "mlflow[databricks]>=3.1.0"
  2. MLflow エクスペリメントを作成するには、環境のセットアップに関するクイックスタートに従ってください。

プロンプトベースのジャッジの力を示す簡単な例を次に示します。

Python
from mlflow.genai.judges import custom_prompt_judge

# Create a multi-level quality judge
response_quality_judge = custom_prompt_judge(
name="response_quality",
prompt_template="""Evaluate the quality of this customer service response:

<request>{{request}}</request>
<response>{{response}}</response>

Choose the most appropriate rating:

[[excellent]]: Empathetic, complete solution, proactive help offered
[[good]]: Addresses the issue adequately with professional tone
[[poor]]: Incomplete, unprofessional, or misses key concerns""",
numeric_values={
"excellent": 1.0,
"good": 0.7,
"poor": 0.0
}
)

# Direct usage
feedback = response_quality_judge(
request="My order arrived damaged!",
response="I'm so sorry to hear that. I've initiated a replacement order that will arrive tomorrow, and issued a full refund. Is there anything else I can help with?"
)

print(feedback.value) # 1.0
print(feedback.metadata) # {"string_value": "excellent"}
print(feedback.rationale) # Detailed explanation of the rating

コアコンセプト

SDK の概要

custom_prompt_judge関数は、プロンプトテンプレートに基づいて入力を評価するカスタムLLMジャッジを作成します。

Python
from mlflow.genai.judges import custom_prompt_judge

judge = custom_prompt_judge(
name="formality",
prompt_template="...", # Your custom prompt with {{variables}} and [[choices]]
numeric_values={"formal": 1.0, "informal": 0.0} # Optional numeric mapping
)

# Returns an mlflow.entities.Feedback object
feedback = judge(request="Hello", response="Hey there!")

パラメーター

パラメーター

Type

必須

説明

name

str

Yes

MLflow UI に表示され、審査員の出力を識別するために使用される評価の名前

prompt_template

str

Yes

以下を含むテンプレート文字列: - {{variables}}:動的コンテンツのプレースホルダー - [[choices]]: 審査員が選択しなければならない必須の選択肢の定義

numeric_values

dict[str, float] | None

No

選択肢の名前を数値スコアにマップします (0 から 1 のスケールを推奨)。 - なし: 文字列の選択肢の値を返します - あり: 数値スコアを返し、文字列の選択肢をメタデータに格納します

model

str | None

No

使用する特定のジャッジモデル(デフォルトから MLflowの最適化されたジャッジモデル)

数値マッピングを使用する理由

複数の選択肢のラベル (例: "excellent"、"good"、"poor") がある場合、文字列値を使用すると、評価実行全体で品質の向上を追跡するのが難しくなります。

数値マッピングにより、次のことが可能になります。

  • 定量的な比較 :平均品質が0.6から0.8に改善されたかどうかを確認します
  • メトリクスの集計 : データセット全体の平均スコアを計算します
  • バージョンの比較 : 変更の品質が向上したか低下したかを追跡します
  • 閾値ベースのモニタリング :品質が許容レベルを下回ったときにアラートを設定

数値がないと、ラベルの分布しか表示できないため(例:「良い」40%、悪い60%)、全体的な改善を測定するのが難しくなります。

戻り値

この関数は、次のような呼び出し可能オブジェクトを返します。

  • プロンプトテンプレートの {{variables}} に一致するキーワード引数を受け入れます
  • 次のものを含む mlflow.entities.Feedback オブジェクトを返します。
    • value: 選択した選択肢 (文字列) または数値スコア ( numeric_values が指定されている場合)
    • rationale:LLMの選択の説明
    • metadata: 数値を使用する場合の文字列選択を含む追加情報
    • name: 指定した名前
    • error: 評価に失敗した場合のエラー内容

プロンプト・テンプレートの要件

選択肢定義形式

選択肢は、二重角括弧を使用して定義する必要があります[[choice_name]]

Python
prompt_template = """Evaluate the response formality:

<request>{{request}}</request>
<response>{{response}}</response>

Select one category:

[[formal]]: Professional language, proper grammar, no contractions
[[semi_formal]]: Mix of professional and conversational elements
[[informal]]: Casual language, contractions, colloquialisms"""

変数プレースホルダー

動的コンテンツには、二重中括弧を {{variable}} 使用します。

Python
prompt_template = """Assess if the response uses appropriate sources:

Question: {{question}}
Response: {{response}}
Available Sources: {{retrieved_documents}}
Citation Policy: {{citation_policy}}

Choose one:

[[well_cited]]: All claims properly cite available sources
[[partially_cited]]: Some claims cite sources, others do not
[[poorly_cited]]: Claims lack proper citations"""

一般的な評価パターン

リッカートスケールパターン

標準の5点または7点の満足度スケールを作成します。

Python
satisfaction_judge = custom_prompt_judge(
name="customer_satisfaction",
prompt_template="""Based on this interaction, rate the likely customer satisfaction:

Customer Request: {{request}}
Agent Response: {{response}}

Select satisfaction level:

[[very_satisfied]]: Response exceeds expectations with exceptional service
[[satisfied]]: Response meets expectations adequately
[[neutral]]: Response is acceptable but unremarkable
[[dissatisfied]]: Response fails to meet basic expectations
[[very_dissatisfied]]: Response is unhelpful or problematic""",
numeric_values={
"very_satisfied": 1.0,
"satisfied": 0.75,
"neutral": 0.5,
"dissatisfied": 0.25,
"very_dissatisfied": 0.0
}
)

ルーブリックベースの採点

明確な基準で詳細な採点ルーブリックを実装します。

Python
code_review_rubric = custom_prompt_judge(
name="code_review_rubric",
prompt_template="""Evaluate this code review using our quality rubric:

Original Code: {{original_code}}
Review Comments: {{review_comments}}
Code Type: {{code_type}}

Score the review quality:

[[comprehensive]]: Identifies all issues including edge cases, security concerns, performance implications, and suggests specific improvements with examples
[[thorough]]: Catches major issues and most minor ones, provides good suggestions but may miss some edge cases
[[adequate]]: Identifies obvious issues and provides basic feedback, misses subtle problems
[[superficial]]: Only catches surface-level issues, feedback is vague or generic
[[inadequate]]: Misses critical issues or provides incorrect feedback""",
numeric_values={
"comprehensive": 1.0,
"thorough": 0.8,
"adequate": 0.6,
"superficial": 0.3,
"inadequate": 0.0
}
)

実例

顧客 サービスの品質

Python
from mlflow.genai.judges import custom_prompt_judge
from mlflow.genai.scorers import scorer
import mlflow

# Issue resolution status judge
resolution_judge = custom_prompt_judge(
name="issue_resolution",
prompt_template="""Evaluate if the customer's issue was resolved:

Customer Message: {{customer_message}}
Agent Response: {{agent_response}}
Issue Type: {{issue_type}}

Rate the resolution status:

[[fully_resolved]]: Issue completely addressed with clear solution provided
[[partially_resolved]]: Some progress made but follow-up needed
[[unresolved]]: Issue not addressed or solution unclear
[[escalated]]: Appropriately escalated to higher support tier""",
numeric_values={
"fully_resolved": 1.0,
"partially_resolved": 0.5,
"unresolved": 0.0,
"escalated": 0.7 # Positive score for appropriate escalation
}
)

# Empathy and tone judge
empathy_judge = custom_prompt_judge(
name="empathy_score",
prompt_template="""Assess the emotional intelligence of the response:

Customer Emotion: {{customer_emotion}}
Agent Response: {{agent_response}}

Rate the empathy shown:

[[exceptional]]: Acknowledges emotions, validates concerns, shows genuine care
[[good]]: Shows understanding and appropriate concern
[[adequate]]: Professional but somewhat impersonal
[[poor]]: Cold, dismissive, or inappropriate emotional response""",
numeric_values={
"exceptional": 1.0,
"good": 0.75,
"adequate": 0.5,
"poor": 0.0
}
)

# Create a comprehensive customer service scorer
@scorer
def customer_service_quality(inputs, outputs, trace):
"""Comprehensive customer service evaluation"""
feedbacks = []

# Evaluate resolution status
feedbacks.append(resolution_judge(
customer_message=inputs.get("message", ""),
agent_response=outputs.get("response", ""),
issue_type=inputs.get("issue_type", "general")
))

# Evaluate empathy if customer shows emotion
customer_emotion = inputs.get("detected_emotion", "neutral")
if customer_emotion in ["frustrated", "angry", "upset", "worried"]:
feedbacks.append(empathy_judge(
customer_emotion=customer_emotion,
agent_response=outputs.get("response", "")
))

return feedbacks

# Example evaluation
eval_data = [
{
"inputs": {
"message": "I've been waiting 3 weeks for my refund! This is unacceptable!",
"issue_type": "refund",
"detected_emotion": "angry"
},
"outputs": {
"response": "I completely understand your frustration - 3 weeks is far too long to wait for a refund. I'm escalating this to our finance team immediately. You'll receive your refund within 24 hours, plus a $50 credit for the inconvenience. I'm also sending you my direct email so you can reach me if there are any other delays."
}
}
]

results = mlflow.genai.evaluate(
data=eval_data,
scorers=[customer_service_quality]
)

コンテンツ品質評価

Python
# Technical documentation quality judge
doc_quality_judge = custom_prompt_judge(
name="documentation_quality",
prompt_template="""Evaluate this technical documentation:

Content: {{content}}
Target Audience: {{audience}}
Documentation Type: {{doc_type}}

Rate the documentation quality:

[[excellent]]: Clear, complete, well-structured with examples, appropriate depth
[[good]]: Covers topic well, mostly clear, could use minor improvements
[[fair]]: Basic coverage, some unclear sections, missing important details
[[poor]]: Confusing, incomplete, or significantly flawed""",
numeric_values={
"excellent": 1.0,
"good": 0.75,
"fair": 0.4,
"poor": 0.0
}
)

# Marketing copy effectiveness
marketing_judge = custom_prompt_judge(
name="marketing_effectiveness",
prompt_template="""Rate this marketing copy's effectiveness:

Copy: {{copy}}
Product: {{product}}
Target Demographic: {{target_demographic}}
Call to Action: {{cta}}

Evaluate effectiveness:

[[highly_effective]]: Compelling, clear value prop, strong CTA, perfect for audience
[[effective]]: Good messaging, decent CTA, reasonably targeted
[[moderately_effective]]: Some good elements but lacks impact or clarity
[[ineffective]]: Weak messaging, unclear value, poor audience fit""",
numeric_values={
"highly_effective": 1.0,
"effective": 0.7,
"moderately_effective": 0.4,
"ineffective": 0.0
}
)

コードレビューの品質

Python
# Security review judge
security_review_judge = custom_prompt_judge(
name="security_review_quality",
prompt_template="""Evaluate the security aspects of this code review:

Original Code: {{code}}
Review Comments: {{review_comments}}
Security Vulnerabilities Found: {{vulnerabilities_mentioned}}

Rate the security review quality:

[[comprehensive]]: Identifies all security issues, explains risks, suggests secure alternatives
[[thorough]]: Catches major security flaws, good explanations
[[basic]]: Identifies obvious security issues only
[[insufficient]]: Misses critical security vulnerabilities""",
numeric_values={
"comprehensive": 1.0,
"thorough": 0.75,
"basic": 0.4,
"insufficient": 0.0
}
)

# Code clarity feedback judge
code_clarity_judge = custom_prompt_judge(
name="code_clarity_feedback",
prompt_template="""Assess the code review's feedback on readability:

Original Code Complexity: {{complexity_score}}
Review Feedback: {{review_comments}}
Readability Issues Identified: {{readability_issues}}

Rate the clarity feedback:

[[excellent]]: Identifies all clarity issues, suggests specific improvements, considers maintainability
[[good]]: Points out main clarity problems with helpful suggestions
[[adequate]]: Basic feedback on obvious readability issues
[[minimal]]: Superficial or missing important clarity feedback""",
numeric_values={
"excellent": 1.0,
"good": 0.7,
"adequate": 0.4,
"minimal": 0.0
}
)

ヘルスケアコミュニケーション

Python
# Patient communication appropriateness
patient_comm_judge = custom_prompt_judge(
name="patient_communication",
prompt_template="""Evaluate this healthcare provider's response to a patient:

Patient Question: {{patient_question}}
Provider Response: {{provider_response}}
Patient Health Literacy Level: {{health_literacy}}
Sensitive Topics: {{sensitive_topics}}

Rate communication appropriateness:

[[excellent]]: Clear, compassionate, appropriate language level, addresses concerns fully
[[good]]: Generally clear and caring, minor room for improvement
[[acceptable]]: Adequate but could be clearer or more empathetic
[[poor]]: Unclear, uses too much jargon, or lacks appropriate empathy""",
numeric_values={
"excellent": 1.0,
"good": 0.75,
"acceptable": 0.5,
"poor": 0.0
}
)

# Clinical note quality
clinical_note_judge = custom_prompt_judge(
name="clinical_note_quality",
prompt_template="""Assess this clinical note's quality:

Note Content: {{note_content}}
Note Type: {{note_type}}
Required Elements: {{required_elements}}

Rate the clinical documentation:

[[comprehensive]]: All required elements present, clear, follows standards, actionable
[[complete]]: Most elements present, generally clear, minor gaps
[[incomplete]]: Missing important elements or lacks clarity
[[deficient]]: Significant gaps, unclear, or doesn't meet documentation standards""",
numeric_values={
"comprehensive": 1.0,
"complete": 0.7,
"incomplete": 0.3,
"deficient": 0.0
}
)

ペアワイズ応答の比較

プロンプトベースのジャッジを使用して、2つの回答を比較し、どちらが優れているかを判断します。これは、A/Bテスト、モデル比較、または好みの学習に役立ちます。

注記

ペアワイズ比較ジャッジは、単一の回答ではなく2つの回答を同時に評価するため、 mlflow.evaluate() または採点者として使用することはできません。比較分析に直接使用します。

Python
from mlflow.genai.judges import custom_prompt_judge

# Response preference judge
preference_judge = custom_prompt_judge(
name="response_preference",
prompt_template="""Compare these two responses to the same question and determine which is better:

Question: {{question}}

Response A: {{response_a}}

Response B: {{response_b}}

Evaluation Criteria:
1. Accuracy and completeness of information
2. Clarity and ease of understanding
3. Helpfulness and actionability
4. Appropriate tone for the context

Choose your preference:

[[strongly_prefer_a]]: Response A is significantly better across most criteria
[[slightly_prefer_a]]: Response A is marginally better overall
[[equal]]: Both responses are equally good (or equally poor)
[[slightly_prefer_b]]: Response B is marginally better overall
[[strongly_prefer_b]]: Response B is significantly better across most criteria""",
numeric_values={
"strongly_prefer_a": -1.0,
"slightly_prefer_a": -0.5,
"equal": 0.0,
"slightly_prefer_b": 0.5,
"strongly_prefer_b": 1.0
}
)

# Example usage for model comparison
question = "How do I improve my GenAI app's response quality?"

response_model_v1 = """To improve response quality, you should:
1. Add more training data
2. Fine-tune your model
3. Use better prompts"""

response_model_v2 = """To improve your GenAI app's response quality, consider these strategies:

1. **Enhance your prompts**: Use clear, specific instructions with examples
2. **Implement evaluation**: Use MLflow's LLM judges to measure quality systematically
3. **Collect feedback**: Gather user feedback to identify improvement areas
4. **Iterate on weak areas**: Focus on responses that score poorly
5. **A/B test changes**: Compare versions to ensure improvements

Start with evaluation to establish a baseline, then iterate based on data."""

# Compare responses
feedback = preference_judge(
question=question,
response_a=response_model_v1,
response_b=response_model_v2
)

print(f"Preference: {feedback.metadata['string_value']}") # "strongly_prefer_b"
print(f"Score: {feedback.value}") # 1.0
print(f"Rationale: {feedback.rationale}")

専門の比較審査員

Python
# Technical accuracy comparison for documentation
tech_comparison_judge = custom_prompt_judge(
name="technical_comparison",
prompt_template="""Compare these two technical explanations:

Topic: {{topic}}
Target Audience: {{audience}}

Explanation A: {{explanation_a}}

Explanation B: {{explanation_b}}

Focus on:
- Technical accuracy and precision
- Appropriate depth for the audience
- Use of examples and analogies
- Completeness without overwhelming detail

Which explanation is better?

[[a_much_better]]: A is significantly more accurate and appropriate
[[a_slightly_better]]: A is marginally better in accuracy or clarity
[[equivalent]]: Both are equally good technically
[[b_slightly_better]]: B is marginally better in accuracy or clarity
[[b_much_better]]: B is significantly more accurate and appropriate""",
numeric_values={
"a_much_better": -1.0,
"a_slightly_better": -0.5,
"equivalent": 0.0,
"b_slightly_better": 0.5,
"b_much_better": 1.0
}
)

# Empathy comparison for customer service
empathy_comparison_judge = custom_prompt_judge(
name="empathy_comparison",
prompt_template="""Compare the emotional intelligence of these customer service responses:

Customer Situation: {{situation}}
Customer Emotion: {{emotion}}

Agent Response A: {{response_a}}

Agent Response B: {{response_b}}

Evaluate which response better:
- Acknowledges the customer's emotions
- Shows genuine understanding and care
- Offers appropriate emotional support
- Maintains professional boundaries

Which response shows better emotional intelligence?

[[a_far_superior]]: A shows much better emotional intelligence
[[a_better]]: A is somewhat more empathetic
[[both_good]]: Both show good emotional intelligence
[[b_better]]: B is somewhat more empathetic
[[b_far_superior]]: B shows much better emotional intelligence""",
numeric_values={
"a_far_superior": -1.0,
"a_better": -0.5,
"both_good": 0.0,
"b_better": 0.5,
"b_far_superior": 1.0
}
)

実用的な比較ワークフロー

Python
# Compare outputs from different prompt versions
def compare_prompt_versions(test_cases, prompt_v1, prompt_v2, model_client):
"""Compare two prompt versions across multiple test cases"""
results = []

for test_case in test_cases:
# Generate responses with each prompt
response_v1 = model_client.generate(prompt_v1.format(**test_case))
response_v2 = model_client.generate(prompt_v2.format(**test_case))

# Compare responses
feedback = preference_judge(
question=test_case["question"],
response_a=response_v1,
response_b=response_v2
)

results.append({
"question": test_case["question"],
"preference": feedback.metadata["string_value"],
"score": feedback.value,
"rationale": feedback.rationale
})

# Analyze results
avg_score = sum(r["score"] for r in results) / len(results)

if avg_score < -0.2:
print(f"Prompt V1 is preferred (avg score: {avg_score:.2f})")
elif avg_score > 0.2:
print(f"Prompt V2 is preferred (avg score: {avg_score:.2f})")
else:
print(f"Prompts perform similarly (avg score: {avg_score:.2f})")

return results

# Compare different model outputs
def compare_models(questions, model_a, model_b, comparison_judge):
"""Compare two models across a set of questions"""
win_counts = {"model_a": 0, "model_b": 0, "tie": 0}

for question in questions:
response_a = model_a.generate(question)
response_b = model_b.generate(question)

feedback = comparison_judge(
question=question,
response_a=response_a,
response_b=response_b
)

# Count wins based on preference strength
if feedback.value <= -0.5:
win_counts["model_a"] += 1
elif feedback.value >= 0.5:
win_counts["model_b"] += 1
else:
win_counts["tie"] += 1

print(f"Model comparison results: {win_counts}")
return win_counts

高度な使用パターン

条件付きスコアリング

コンテキストに基づいてさまざまな評価基準を実装します。

Python
@scorer
def adaptive_quality_scorer(inputs, outputs, trace):
"""Applies different judges based on context"""

# Determine which judge to use based on input characteristics
query_type = inputs.get("query_type", "general")

if query_type == "technical":
judge = custom_prompt_judge(
name="technical_response",
prompt_template="""Rate this technical response:

Question: {{question}}
Response: {{response}}
Required Depth: {{depth_level}}

[[expert]]: Demonstrates deep expertise, includes advanced concepts
[[proficient]]: Good technical accuracy, appropriate depth
[[basic]]: Correct but lacks depth or nuance
[[incorrect]]: Contains technical errors or misconceptions""",
numeric_values={
"expert": 1.0,
"proficient": 0.75,
"basic": 0.5,
"incorrect": 0.0
}
)

return judge(
question=inputs["question"],
response=outputs["response"],
depth_level=inputs.get("required_depth", "intermediate")
)

elif query_type == "support":
judge = custom_prompt_judge(
name="support_response",
prompt_template="""Rate this support response:

Issue: {{issue}}
Response: {{response}}
Customer Status: {{customer_status}}

[[excellent]]: Solves issue completely, proactive, appropriate for customer status
[[good]]: Addresses issue well, professional
[[fair]]: Partially helpful but incomplete
[[poor]]: Unhelpful or inappropriate""",
numeric_values={
"excellent": 1.0,
"good": 0.7,
"fair": 0.4,
"poor": 0.0
}
)

return judge(
issue=inputs["question"],
response=outputs["response"],
customer_status=inputs.get("customer_status", "standard")
)

スコア集計戦略

複数の審査員のスコアをインテリジェントに組み合わせます。

Python
@scorer
def weighted_quality_scorer(inputs, outputs, trace):
"""Combines multiple judges with weighted scoring"""

# Define judges and their weights
judges_config = [
{
"judge": custom_prompt_judge(
name="accuracy",
prompt_template="...", # Your accuracy template
numeric_values={"high": 1.0, "medium": 0.5, "low": 0.0}
),
"weight": 0.4,
"args": {"question": inputs["question"], "response": outputs["response"]}
},
{
"judge": custom_prompt_judge(
name="completeness",
prompt_template="...", # Your completeness template
numeric_values={"complete": 1.0, "partial": 0.5, "incomplete": 0.0}
),
"weight": 0.3,
"args": {"response": outputs["response"], "requirements": inputs.get("requirements", [])}
},
{
"judge": custom_prompt_judge(
name="clarity",
prompt_template="...", # Your clarity template
numeric_values={"clear": 1.0, "adequate": 0.6, "unclear": 0.0}
),
"weight": 0.3,
"args": {"response": outputs["response"]}
}
]

# Collect all feedbacks
feedbacks = []
weighted_score = 0.0

for config in judges_config:
feedback = config["judge"](**config["args"])
feedbacks.append(feedback)

# Add to weighted score if numeric
if isinstance(feedback.value, (int, float)):
weighted_score += feedback.value * config["weight"]

# Add composite score as additional feedback
from mlflow.entities import Feedback
composite_feedback = Feedback(
name="weighted_quality_score",
value=weighted_score,
rationale=f"Weighted combination of {len(judges_config)} quality dimensions"
)
feedbacks.append(composite_feedback)

return feedbacks

おすすめの方法

効果的な選択肢の設計

1. 相互に排他的かつ網羅的な選択をする

Python
# Good - clear distinctions, covers all cases
"""[[approved]]: Meets all requirements, ready for production
[[needs_revision]]: Has issues that must be fixed before approval
[[rejected]]: Fundamental flaws, requires complete rework"""

# Bad - overlapping and ambiguous
"""[[good]]: The response is good
[[okay]]: The response is okay
[[fine]]: The response is fine"""

2. 各選択肢に具体的な基準を設ける

Python
# Good - specific, measurable criteria
"""[[secure]]: No vulnerabilities, follows all security best practices, includes input validation
[[mostly_secure]]: Minor security concerns that should be addressed but aren't critical
[[insecure]]: Contains vulnerabilities that could be exploited"""

# Bad - vague criteria
"""[[secure]]: Looks secure
[[not_secure]]: Has problems"""

3.選択肢を論理的に並べる(最善から最悪へ)

Python
# Good - clear progression
numeric_values = {
"exceptional": 1.0,
"good": 0.75,
"satisfactory": 0.5,
"needs_improvement": 0.25,
"unacceptable": 0.0
}

数値スケール設計

1. 審査員間で一貫したスケールを使用する

Python
# All judges use 0-1 scale
quality_judge = custom_prompt_judge(..., numeric_values={"high": 1.0, "medium": 0.5, "low": 0.0})
accuracy_judge = custom_prompt_judge(..., numeric_values={"accurate": 1.0, "partial": 0.5, "wrong": 0.0})

2. 将来の改良のためにギャップを残す

Python
# Allows adding intermediate levels later
numeric_values = {
"excellent": 1.0,
"good": 0.7, # Gap allows for "very_good" at 0.85
"fair": 0.4, # Gap allows for "satisfactory" at 0.55
"poor": 0.0
}

3. ドメイン固有のスケールを考慮する

Python
# Academic grading scale
academic_scale = {
"A": 4.0,
"B": 3.0,
"C": 2.0,
"D": 1.0,
"F": 0.0
}

# Net Promoter Score scale
nps_scale = {
"promoter": 1.0, # 9-10
"passive": 0.0, # 7-8
"detractor": -1.0 # 0-6
}

迅速なエンジニアリングのヒント

1.構造は明確に促します

Python
prompt_template = """[Clear Task Description]
Evaluate the technical accuracy of this response.

[Context Section]
Question: {{question}}
Response: {{response}}
Technical Domain: {{domain}}

[Evaluation Criteria]
Consider: factual accuracy, appropriate depth, correct terminology

[Choice Definitions]
[[accurate]]: All technical facts correct, appropriate level of detail
[[mostly_accurate]]: Minor inaccuracies that don't affect core understanding
[[inaccurate]]: Contains significant errors or misconceptions"""

2. 参考になる場合は例を挙げる

Python
prompt_template = """Assess the urgency level of this support ticket.

Ticket: {{ticket_content}}

Examples of each level:
- Critical: System down, data loss, security breach
- High: Major feature broken, blocking work
- Medium: Performance issues, non-critical bugs
- Low: Feature requests, minor UI issues

Choose urgency level:
[[critical]]: Immediate attention required, business impact
[[high]]: Urgent, significant user impact
[[medium]]: Important but not urgent
[[low]]: Can be addressed in normal workflow"""

ガイドラインに基づく審査員との比較

アスペクト

ガイドラインに基づく

プロンプトベース

評価タイプ

バイナリの合格/不合格

マルチレベルカテゴリ

採点

「はい」または「いいえ」

オプションの数値を含むカスタム選択肢

どのようなタスクにベストなのか

コンプライアンス、ポリシーの遵守

品質評価、満足度評価

イテレーション速度

非常に高速 - ガイドラインのテキストを更新するだけ

中程度 - 選択肢の調整が必要な場合があります

ビジネスユーザーフレンドリー

✅ 高 - 自然言語ルール

⚠️ 中 - 選択肢と完全なプロンプトを理解する必要があります

集計

合格/不合格率をカウントする

平均を計算し、傾向を追跡します

検証とエラー処理

選択肢の検証

裁判官は、以下のことを検証します。

  • 選択肢は [[choice_name]] 形式で適切に定義されています
  • 選択肢の名前は英数字です (アンダースコアを含めることができます)
  • テンプレートで少なくとも 1 つの選択肢が定義されている
Python
# This will raise an error - no choices defined
invalid_judge = custom_prompt_judge(
name="invalid",
prompt_template="Rate the response: {{response}}"
)
# ValueError: Prompt template must include choices denoted with [[CHOICE_NAME]]

数値の検証

numeric_valuesを使用する場合は、すべての選択肢をマッピングする必要があります。

Python
# This will raise an error - missing choice in numeric_values
invalid_judge = custom_prompt_judge(
name="invalid",
prompt_template="""Choose:
[[option_a]]: First option
[[option_b]]: Second option""",
numeric_values={"option_a": 1.0} # Missing option_b
)
# ValueError: numeric_values keys must match the choices

テンプレート変数の検証

テンプレート変数が欠落していると、実行中にエラーが発生します。

Python
judge = custom_prompt_judge(
name="test",
prompt_template="{{request}} {{response}} [[good]]: Good"
)

# This will raise an error - missing 'response' variable
judge(request="Hello")
# KeyError: Template variable 'response' not found

次のステップ