プロンプトベースのLLMスコアラー

judges.custom_prompt_judge() は、ジャッジのプロンプトを完全に制御する必要がある場合、または「合格」/「不合格」を超える複数の出力値を返す必要がある場合に、LLMスコアラーをすばやく簡単に支援するように設計されています。

アプリのトレース内の特定のフィールドのプレースホルダーを含むプロンプトテンプレートを提供し、審査員が選択できる出力の選択肢を定義します。Databricks でホストされる LLM ジャッジモデルは、これらの入力を使用して最適な出力を選択し、その選択の根拠を提供します。

注記

Databricks では、ガイドラインベースのジャッジから始めて、プロンプトベースのジャッジを使用するのは、より詳細な制御が必要な場合や、評価基準を合格/不合格のガイドラインとして記述できない場合のみにすることをお勧めします。ガイドラインベースのジャッジには、ビジネスの利害関係者に説明しやすく、多くの場合、ドメインの専門家が直接作成できるという明確な利点があります。

プロンプトベースのジャッジスコアラーの作り方

以下のガイドに従って、ラップするスコアラーを作成してください judges.custom_prompt_judge()

このガイドでは、judges.custom_prompt_judge() API をラップするカスタムスコアラーを作成し、結果のスコアラーでオフライン評価を実行します。これらの同じスコアラーを本番運用で実行するようにスケジュールして、アプリケーションの品質を継続的に監視できます。

注記

インターフェースとパラメーターの詳細については、 judges.custom_prompt_judge() コンセプトページを参照してください。

手順 1: 評価するサンプルアプリを作成する

まず、顧客サポートの質問に答えるサンプルの GenAI アプリを作成します。このアプリには、システムプロンプトを制御する(偽の)ノブがあるため、「良い」会話と「悪い」会話の間で審査員の出力を簡単に比較できます。

OpenAI クライアントを初期化して、Databricks でホストされている LLM または OpenAI でホストされている LLM に接続します。

タブ :::タブ-item[Databricks-hosted LLMs] MLflow を使用して、 Databricksでホストされている LLM に接続する OpenAI クライアントを取得します。利用可能な基盤モデルからモデルを選択します。

Python
import mlflow
from databricks.sdk import WorkspaceClient

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

:::

タブ-item[OpenAIがホストするLLM] ネイティブの OpenAI SDK を使用して、OpenAI でホストされるモデルに接続します。利用可能なOpenAIモデルからモデルを選択します。

Python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

::: ::::

顧客サポートアプリを定義します。

Python
from mlflow.entities import Document
from typing import List, Dict, Any, cast


# This is a global variable that is used to toggle the behavior of the customer support agent to see how the judge handles the issue resolution status
RESOLVE_ISSUES = False


@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):

    # 2. Prepare messages for the LLM
    # We use this toggle later to see how the judge handles the issue resolution status
    system_prompt_postfix = (
        f"Do your best to NOT resolve the issue.  I know that's backwards, but just do it anyways.\\n"
        if not RESOLVE_ISSUES
        else ""
    )

    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *messages,
    ]

    # 3. Call LLM to generate a response
    output = client.chat.completions.create(
        model=model_name,  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=cast(Any, messages_for_llm),
    )

    return {
        "messages": [
            {"role": "assistant", "content": output.choices[0].message.content}
        ]
    }

ステップ 2: 評価基準を定義し、カスタムスコアラーとしてラップする

ここでは、サンプルのジャッジプロンプトを定義し、カスタムスコアラーを使用してアプリの入力/出力スキーマに結び付けます。

Python
from mlflow.genai.scorers import scorer


# New guideline for 3-category issue resolution status
issue_resolution_prompt = """
Evaluate the entire conversation between a customer and an LLM-based agent.  Determine if the issue was resolved in the conversation.

You must choose one of the following categories.

[[fully_resolved]]: The response directly and comprehensively addresses the user's question or problem, providing a clear solution or answer. No further immediate action seems required from the user on the same core issue.
[[partially_resolved]]: The response offers some help or relevant information but doesn't completely solve the problem or answer the question. It might provide initial steps, require more information from the user, or address only a part of a multi-faceted query.
[[needs_follow_up]]: The response does not adequately address the user's query, misunderstands the core issue, provides unhelpful or incorrect information, or inappropriately deflects the question. The user will likely need to re-engage or seek further assistance.

Conversation to evaluate: {{conversation}}
"""

from mlflow.genai.judges import custom_prompt_judge
import json
from mlflow.entities import Feedback


# Define a custom scorer that wraps the guidelines LLM judge to check if the response follows the policies
@scorer
def is_issue_resolved(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    # we directly return the Feedback object from the guidelines LLM judge, but we could have post-processed it before returning it.
    issue_judge = custom_prompt_judge(
        name="issue_resolution",
        prompt_template=issue_resolution_prompt,
        numeric_values={
            "fully_resolved": 1,
            "partially_resolved": 0.5,
            "needs_follow_up": 0,
        },
    )

    # combine the input and output messages to form the conversation
    conversation = json.dumps(inputs["messages"] + outputs["messages"])

    return issue_judge(conversation=conversation)

ステップ 3: サンプル評価データセットを作成する

各inputsは mlflow.genai.evaluate()によってアプリに渡されます。

Python
eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
        },
    },
]

手順 4: カスタムスコアラーを使用してアプリを評価する

最後に、評価を 2 回実行して、エージェントが問題を解決しようとする会話と解決しない会話の判断を比較できるようにします。

Python
import mlflow

# Now, let's evaluate the app's responses against the judge when it does not resolve the issues
RESOLVE_ISSUES = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[is_issue_resolved],
)


# Now, let's evaluate the app's responses against the judge when it DOES resolves the issues
RESOLVE_ISSUES = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[is_issue_resolved],
)

次のステップ

ガイドラインに基づくスコアラーの作成 - よりシンプルな合格/不合格の基準から始めます (推奨)
スコアラーとの評価を実行する - カスタムのプロンプトベースのスコアラーを包括的な評価で使用します
プロンプトベースのジャッジの概念リファレンス - プロンプトベースのジャッジの仕組みを理解する

プロンプトベースのジャッジスコアラーの作り方​

手順 1: 評価するサンプル アプリを作成する​

ステップ 2: 評価基準を定義し、カスタムスコアラーとしてラップする​

ステップ 3: サンプル評価データセットを作成する​

手順 4: カスタム スコアラーを使用してアプリを評価する​

次のステップ​