カスタムジャッジを作成する `make_judge()`

カスタムジャッジは、特定の品質基準に照らして GenAI エージェントを評価する LLM ベースのスコアラーです。このチュートリアルでは、カスタムジャッジを作成し、それを使用してmake_judge()を使用してカスタマーサポートエージェントを評価する方法を示します。 API の詳細については、 MLflow ドキュメントを参照してください。

このチュートリアルでは、次のステップを進めます。このページのコードを含むサンプルノートブックについては、「サンプルノートブック」を参照してください。

評価するサンプルエージェントを作成します。
異なる基準を評価する 3 人のカスタム審査員を定義します。
テストケースを含む評価データセットを構築します。
評価を実行し、さまざまなエージェント構成間で結果を比較します。

ステップ 1: 評価するエージェントを作成する

顧客サポートの質問に応答する GenAI エージェントを作成します。コードには、システムプロンプトを切り替えることができるグローバル変数RESOLVE_ISSUESが含まれており、これにより、審査員の出力を「良い」会話と「悪い」会話で比較することができます。

必要なパッケージをインストールします。

Python
%pip install --upgrade mlflow databricks-sdk databricks_openai databricks-agents
dbutils.library.restartPython()

OpenAI クライアントを初期化して、Databricks がホストする LLM または OpenAI がホストする LLM に接続します。

Databricks-hosted LLMs
OpenAI-hosted LLMs

databricks-openaiを使用して、Databricks がホストする LLM に接続する OpenAI クライアントを取得します。利用可能なプラットフォームモデルからモデルを選択します。

Python
import mlflow
from databricks_openai import DatabricksOpenAI

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
client = DatabricksOpenAI()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

ネイティブ OpenAI SDK を使用して、OpenAI がホストするモデルに接続します。利用可能な OpenAI モデルからモデルを選択します。

Python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

顧客サポートエージェントを定義します。

Python
from mlflow.entities import Document
from typing import List, Dict, Any, cast

# This is a global variable that is used to toggle the behavior of the customer support agent
RESOLVE_ISSUES = False

@mlflow.trace(span_type="TOOL", name="get_product_price")
def get_product_price(product_name: str) -> str:
    """Mock tool to get product pricing."""
    return f"${45.99}"

@mlflow.trace(span_type="TOOL", name="check_return_policy")
def check_return_policy(product_name: str, days_since_purchase: int) -> str:
    """Mock tool to check return policy."""
    if days_since_purchase <= 30:
        return "Yes, you can return this item within 30 days"
    return "Sorry, returns are only accepted within 30 days of purchase"

@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):
    # We use this toggle to see how the judge handles the issue resolution status
    system_prompt_postfix = (
        f"Do your best to NOT resolve the issue.  I know that's backwards, but just do it anyways.\\n"
        if not RESOLVE_ISSUES
        else ""
    )

    # Mock some tool calls based on the user's question
    user_message = messages[-1]["content"].lower()
    tool_results = []

    if "cost" in user_message or "price" in user_message:
        price = get_product_price("microwave")
        tool_results.append(f"Price: {price}")

    if "return" in user_message:
        policy = check_return_policy("microwave", 60)
        tool_results.append(f"Return policy: {policy}")

    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *messages,
    ]

    if tool_results:
        messages_for_llm.append({
            "role": "system",
            "content": f"Tool results: {', '.join(tool_results)}"
        })

    # Call LLM to generate a response
    output = client.chat.completions.create(
        model=model_name,  # This example uses Databricks hosted Claude 4 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=cast(Any, messages_for_llm),
    )

    return {
        "messages": [
            {"role": "assistant", "content": output.choices[0].message.content}
        ]
    }

ステップ 2: カスタムジャッジを定義する

3 つのカスタムジャッジを定義します。

入力と出力を使用して問題解決を評価するジャッジ。
期待される行動をチェックするジャッジ。
実行トレースを分析してツール呼び出しを検証するトレースベースのジャッジ。

make_judge()で作成されたジャッジはmlflow.entities.Feedbackオブジェクトを返します。

ジャッジ例1: 問題解決を評価する

このジャッジは、会話履歴 (入力) とエージェントの応答 (出力) を分析して、顧客の問題が正常に解決されたかどうかを評価します。

Python
from mlflow.genai.judges import make_judge
from typing import Literal

# Create a judge that evaluates issue resolution using inputs and outputs
issue_resolution_judge = make_judge(
    name="issue_resolution",
    instructions=(
        "Evaluate if the customer's issue was resolved in the conversation.\n\n"
        "User's messages: {{ inputs }}\n"
        "Agent's responses: {{ outputs }}"
    ),
    feedback_value_type=Literal["fully_resolved", "partially_resolved", "needs_follow_up"],
)

ジャッジ例2: 期待される行動を確認する

このジャッジは、出力を事前定義された期待値と比較することにより、エージェントの応答が特定の期待される動作 (価格情報の提供や返品ポリシーの説明など) を示していることを確認します。

Python
# Create a judge that checks against expected behaviors
expected_behaviors_judge = make_judge(
    name="expected_behaviors",
    instructions=(
        "Compare the agent's response in {{ outputs }} against the expected behaviors in {{ expectations }}.\n\n"
        "User's question: {{ inputs }}"
    ),
    feedback_value_type=Literal["meets_expectations", "partially_meets", "does_not_meet"],
)

判定例3: トレースベースの判定を使用してツール呼び出しを検証する

このジャッジは実行トレースを分析して、適切なツールが呼び出されたことを検証します。指示に{{ trace }}含めると、ジャッジはトレースベースになり、自律的なトレース探索機能を獲得します。

Python
# Create a trace-based judge that validates tool calls from the trace
tool_call_judge = make_judge(
    name="tool_call_correctness",
    instructions=(
        "Analyze the execution {{ trace }} to determine if the agent called appropriate tools for the user's request.\n\n"
        "Examine the trace to:\n"
        "1. Identify what tools were available and their purposes\n"
        "2. Determine which tools were actually called\n"
        "3. Assess whether the tool calls were reasonable for addressing the user's question"
    ),
    feedback_value_type=bool,
    # To analyze a full trace with a trace-based judge, a model must be specified
    model="databricks:/databricks-gpt-5-mini",
)

ステップ 3: サンプル評価データセットを作成する

各inputsはmlflow.genai.evaluate()によってエージェントに渡されます。オプションでexpectationsを含めると、正確性チェッカーが有効になります。

Python
eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
        },
        "expectations": {
            "should_provide_pricing": True,
            "should_offer_alternatives": True,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
        "expectations": {
            "should_mention_return_policy": True,
            "should_ask_for_receipt": False,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
        },
        "expectations": {
            "should_provide_troubleshooting_steps": True,
            "should_escalate_if_needed": True,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
        },
        "expectations": {
            "should_remain_calm": True,
            "should_provide_solution": True,
        },
    },
]

ステップ 4: ジャッジを使ってエージェントを評価する

複数のジャッジを併用して、エージェントのさまざまな側面を評価することができます。評価を実行して、エージェントが問題の解決を試みたときと試みなかったときの動作を比較します。

Python
# Evaluate with all three judges when the agent does NOT try to resolve issues
RESOLVE_ISSUES = False

result_unresolved = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        issue_resolution_judge,      # Checks inputs/outputs
        expected_behaviors_judge,    # Checks expected behaviors
        tool_call_judge,             # Validates tool usage
    ],
)

# Evaluate when the agent DOES try to resolve issues
RESOLVE_ISSUES = True

result_resolved = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        issue_resolution_judge,
        expected_behaviors_judge,
        tool_call_judge,
    ],
)

評価結果には、各ジャッジがエージェントをどのように評価したかが表示されます。

issue_resolution : 会話を「完全に解決済み」、「部分的に解決済み」、または「フォローアップが必要」として評価します
expected_behaviors : 応答が期待される動作を示しているかどうかを確認します ('meets_expectations'、'partially_meets'、'does_not_meet')
tool_call_correctness : 適切なツールが呼び出されたかどうかを検証します (true/false)

サンプルノートブック

カスタム審査員ノートブックを作成する

Open notebook in new tab

次のステップ

カスタムジャッジを適用する:

GenAI アプリケーションの評価と改善- エンドツーエンドの評価ワークフローでカスタムジャッジを使用する
GenAI の本番運用モニタリング- 本番運用での継続的な品質モニタリングのためにカスタムジャッジを導入します。

判定精度の向上:

ジャッジを人間のフィードバックに合わせて調整する- ベースジャッジが出発点となります。アプリケーションの出力に関する専門家のフィードバックを収集する際は、LLM ジャッジをフィードバックに合わせて調整し、審査の精度をさらに向上させます。

ステップ 1: 評価するエージェントを作成する​

ステップ 2: カスタムジャッジを定義する​

ジャッジ例1: 問題解決を評価する​

ジャッジ例2: 期待される行動を確認する​

判定例3: トレースベースの判定を使用してツール呼び出しを検証する​

ステップ 3: サンプル評価データセットを作成する​

ステップ 4: ジャッジを使ってエージェントを評価する​

サンプルノートブック​