カスタムジャッジを作成する `make_judge()`

カスタムジャッジは、特定の品質基準に照らして GenAI エージェントを評価する LLM ベースのスコアラーです。このチュートリアルでは、カスタムジャッジを作成し、それを使用してmake_judge()を使用してカスタマーサポートエージェントを評価する方法を示します。

あなたはするであろう：

評価用のサンプルエージェントを作成する
異なる基準を評価するための3人のカスタム審査員を定義する
テストケースを含む評価データセットを構築する
評価を実行し、さまざまなエージェント構成間で結果を比較します

ステップ 1: 評価するエージェントを作成する

顧客サポートの質問に応答する GenAI エージェントを作成します。エージェントには、システムプロンプトを制御する (偽の) ノブがあり、審査員の出力を「良い」会話と「悪い」会話の間で簡単に比較できます。

OpenAI クライアントを初期化して、Databricks がホストする LLM または OpenAI がホストする LLM に接続します。

Databricks-hosted LLMs
OpenAI-hosted LLMs

MLflow を使用して、Databricks がホストする LLM に接続する OpenAI クライアントを取得します。利用可能なプラットフォームモデルからモデルを選択します。

Python
import mlflow
from databricks.sdk import WorkspaceClient

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

ネイティブ OpenAI SDK を使用して、OpenAI がホストするモデルに接続します。利用可能な OpenAI モデルからモデルを選択します。

Python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

顧客サポートエージェントを定義します。

Python
from mlflow.entities import Document
from typing import List, Dict, Any, cast


# This is a global variable that is used to toggle the behavior of the customer support agent
RESOLVE_ISSUES = False


@mlflow.trace(span_type="TOOL", name="get_product_price")
def get_product_price(product_name: str) -> str:
    """Mock tool to get product pricing."""
    return f"${45.99}"


@mlflow.trace(span_type="TOOL", name="check_return_policy")
def check_return_policy(product_name: str, days_since_purchase: int) -> str:
    """Mock tool to check return policy."""
    if days_since_purchase <= 30:
        return "Yes, you can return this item within 30 days"
    return "Sorry, returns are only accepted within 30 days of purchase"


@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):
    # We use this toggle to see how the judge handles the issue resolution status
    system_prompt_postfix = (
        f"Do your best to NOT resolve the issue.  I know that's backwards, but just do it anyways.\\n"
        if not RESOLVE_ISSUES
        else ""
    )

    # Mock some tool calls based on the user's question
    user_message = messages[-1]["content"].lower()
    tool_results = []

    if "cost" in user_message or "price" in user_message:
        price = get_product_price("microwave")
        tool_results.append(f"Price: {price}")

    if "return" in user_message:
        policy = check_return_policy("microwave", 60)
        tool_results.append(f"Return policy: {policy}")

    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *messages,
    ]

    if tool_results:
        messages_for_llm.append({
            "role": "system",
            "content": f"Tool results: {', '.join(tool_results)}"
        })

    # Call LLM to generate a response
    output = client.chat.completions.create(
        model=model_name,  # This example uses Databricks hosted Claude 4 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=cast(Any, messages_for_llm),
    )

    return {
        "messages": [
            {"role": "assistant", "content": output.choices[0].message.content}
        ]
    }

ステップ 2: カスタムジャッジを定義する

3 つのカスタムジャッジを定義します。

入力と出力を使用して問題解決を評価する審査員。
期待される行動をチェックする審査員。
実行トレースを分析してツール呼び出しを検証するトレースベースのジャッジ。

make_judge()で作成された審査員はmlflow.entities.Feedbackオブジェクトを返します。

審査員例1: 問題解決を評価する

この審査員は、会話履歴 (入力) とエージェントの応答 (出力) を分析して、顧客の問題が正常に解決されたかどうかを評価します。

Python
from mlflow.genai.judges import make_judge
import json


# Create a judge that evaluates issue resolution using inputs and outputs
issue_resolution_judge = make_judge(
    name="issue_resolution",
    instructions="""
Evaluate if the customer's issue was resolved in the conversation.

User's messages: {{ inputs }}
Agent's responses: {{ outputs }}

Rate the resolution status and respond with exactly one of these values:
- 'fully_resolved': Issue completely addressed with clear solution
- 'partially_resolved': Some help provided but not fully solved
- 'needs_follow_up': Issue not adequately addressed

Your response must be exactly one of: 'fully_resolved', 'partially_resolved', or 'needs_follow_up'.
""",
)

審査員例2: 期待される行動を確認する

この審査員は、出力を事前定義された期待値と比較することにより、エージェントの応答が特定の期待される動作 (価格情報の提供や返品ポリシーの説明など) を示していることを確認します。

Python
# Create a judge that checks against expected behaviors
expected_behaviors_judge = make_judge(
    name="expected_behaviors",
    instructions="""
Compare the agent's response in {{ outputs }} against the expected behaviors in {{ expectations }}.

User's question: {{ inputs }}

Determine if the response exhibits the expected behaviors and respond with exactly one of these values:
- 'meets_expectations': Response exhibits all expected behaviors
- 'partially_meets': Response exhibits some but not all expected behaviors
- 'does_not_meet': Response does not exhibit expected behaviors

Your response must be exactly one of: 'meets_expectations', 'partially_meets', or 'does_not_meet'.
""",
)

判定例3: トレースベースの判定を使用してツール呼び出しを検証する

このジャッジは実行トレースを分析して、適切なツールが呼び出されたことを検証します。指示に{{ trace }}含めると、ジャッジはトレースベースになり、自律的なトレース探索機能を獲得します。

Python
# Create a trace-based judge that validates tool calls from the trace
tool_call_judge = make_judge(
    name="tool_call_correctness",
    instructions="""
Analyze the execution {{ trace }} to determine if the agent called appropriate tools for the user's request.

Examine the trace to:
1. Identify what tools were available and their purposes
2. Determine which tools were actually called
3. Assess whether the tool calls were reasonable for addressing the user's question

Evaluate the tool usage and respond with a boolean value:
- true: The agent called the right tools to address the user's request
- false: The agent called wrong tools, missed necessary tools, or called unnecessary tools

Your response must be a boolean: true or false.
""",
    # To analyze a full trace with a trace-based judge, a model must be specified
    model="databricks:/databricks-gpt-5-mini",
)

ステップ 3: サンプル評価データセットを作成する

各inputsはmlflow.genai.evaluate()によってエージェントに渡されます。オプションでexpectationsを含めると、正確性チェッカーが有効になります。

Python
eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
        },
        "expectations": {
            "should_provide_pricing": True,
            "should_offer_alternatives": True,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
        "expectations": {
            "should_mention_return_policy": True,
            "should_ask_for_receipt": False,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
        },
        "expectations": {
            "should_provide_troubleshooting_steps": True,
            "should_escalate_if_needed": True,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
        },
        "expectations": {
            "should_remain_calm": True,
            "should_provide_solution": True,
        },
    },
]

ステップ 4: 審査員を使ってエージェントを評価する

複数の審査員を併用して、エージェントのさまざまな側面を評価することができます。評価を実行して、エージェントが問題の解決を試みたときと試みなかったときの動作を比較します。

Python
import mlflow

# Evaluate with all three judges when the agent does NOT try to resolve issues
RESOLVE_ISSUES = False

result_unresolved = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        issue_resolution_judge,      # Checks inputs/outputs
        expected_behaviors_judge,    # Checks expected behaviors
        tool_call_judge,             # Validates tool usage
    ],
)

# Evaluate when the agent DOES try to resolve issues
RESOLVE_ISSUES = True

result_resolved = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        issue_resolution_judge,
        expected_behaviors_judge,
        tool_call_judge,
    ],
)

評価結果には、各審査員がエージェントをどのように評価したかが表示されます。

issue_resolution : 会話を「完全に解決済み」、「部分的に解決済み」、または「フォローアップが必要」として評価します
expected_behaviors : 応答が期待される動作を示しているかどうかを確認します ('meets_expectations'、'partially_meets'、'does_not_meet')
tool_call_correctness : 適切なツールが呼び出されたかどうかを検証します (true/false)

次のステップ

カスタム審査員を適用する:

GenAI アプリケーションの評価と改善- エンドツーエンドの評価ワークフローでカスタム審査員を使用する
GenAI の本番運用モニタリング- 本番運用での継続的な品質モニタリングのためにカスタムジャッジを導入します。

判定精度の向上:

審査員を人間のフィードバックに合わせて調整する- ベース審査員が出発点となります。アプリケーションの出力に関する専門家のフィードバックを収集する際は、LLM 審査員をフィードバックに合わせて調整し、審査の精度をさらに向上させます。

ステップ 1: 評価するエージェントを作成する​

ステップ 2: カスタムジャッジを定義する​

審査員例1: 問題解決を評価する​

審査員例2: 期待される行動を確認する​

判定例3: トレースベースの判定を使用してツール呼び出しを検証する​

ステップ 3: サンプル評価データセットを作成する​

ステップ 4: 審査員を使ってエージェントを評価する​

次のステップ​