ガイドラインに基づくLLMスコアラーの作成方法

概要

scorers.Guidelines()とscorers.ExpectationsGuidelines() は、judges.meets_guidelines()Databricks が提供する LLM ジャッジ SDK をラップするスコアラーです。これは、合格/不合格条件として構成される自然言語基準を定義することで、評価を迅速かつ簡単にカスタマイズできるように設計されています。ルール、スタイルガイド、または情報の包含/除外でコンプライアンスを確認するのに最適です。

ガイドラインには、ビジネス関係者に説明しやすい(「アプリがこの一連のルールを満たしているかどうかを評価しています」)、そのため、多くの場合、ドメインの専門家が直接作成できるという明確な利点があります。

ガイドライン LLM ジャッジモデルは、次の 2 つの方法で使用できます。

ガイドラインでアプリの入力と出力のみを考慮し、アプリのトレースに単純な入力(ユーザークエリのみなど)と出力(アプリのレスポンスのみなど)のみがある場合は、事前に作成されたガイドラインスコアラーを使用します。
ガイドラインで追加データ (取得したドキュメントやツール呼び出しなど) を考慮している場合、または評価から除外するフィールド (user_idなど) を含む複雑な入力/出力がトレースにある場合は、judges.meets_guidelines() API をラップするカスタムスコアラーを作成します

注記

事前構築済みのガイドラインスコアラーがトレースを解析する方法の詳細については、ガイドラインの事前構築済みスコアラーのコンセプトページを参照してください。

1. 事前に作成されたガイドラインスコアラーを使用する

このガイドでは、事前構築済みのスコアラーにカスタム評価基準を追加し、結果のスコアラーを使用してオフライン評価を実行します。これらの同じスコアラーを本番運用で実行するようにスケジュールして、アプリケーションの品質を継続的に監視できます。

手順 1: 評価するサンプルアプリを作成する

まず、顧客サポートの質問に答えるサンプルの生成AI アプリを作成しましょう。このアプリには、システムプロンプトを制御するいくつかの(偽の)ノブがあるため、ガイドラインジャッジの出力を「良い」応答と「悪い」応答の間で簡単に比較できます。

OpenAI クライアントを初期化して、Databricks でホストされている LLM または OpenAI でホストされている LLM に接続します。

Databricks-hosted LLMs
OpenAI-hosted LLMs

MLflow を使用して、Databricks でホストされている LLM に接続する OpenAI クライアントを取得します。利用可能な基盤モデルからモデルを選択します。

Python
import mlflow
from databricks.sdk import WorkspaceClient

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

ネイティブの OpenAI SDK を使用して、OpenAI でホストされるモデルに接続します。利用可能なOpenAIモデルからモデルを選択します。

Python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

顧客サポートアプリを定義します。

Python
from typing import List, Dict, Any


# This is a global variable that is used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
BE_RUDE_AND_VERBOSE = False

@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):

 # 1. Prepare messages for the LLM
 system_prompt_postfix = (
     "Be super rude and very verbose in your responses."
     if BE_RUDE_AND_VERBOSE
     else ""
 )
 messages_for_llm = [
     {
         "role": "system",
         "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
     },
     *messages,
 ]

 # 2. Call LLM to generate a response
 return client.chat.completions.create(
     model=model_name,  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
     messages=messages_for_llm,
 )

result = customer_support_agent(
 messages=[
     {"role": "user", "content": "How much does a microwave cost?"},
 ]
)
print(result)

ステップ 2: 評価基準を定義する

通常、ビジネスの利害関係者と協力してガイドラインを定義します。ここでは、いくつかのサンプルガイドラインを定義します。ガイドラインを作成するときは、アプリの入力を the request と呼び、アプリの出力を the responseと呼びます。LLMジャッジに渡されるデータを理解するには、事前定義されたガイドラインスコアラーセクションによる入力と出力の解析方法を参照してください。

Python
tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
structure = "The response must use clear, concise language and structures responses logically.  It must avoids jargon or explains technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."
relevance = "The response must be relevant to the user's request.  Only consider the relevance and nothing else. If the request is not clear, the response must ask for more information."

注記

ガイドラインは、必要に応じて長くても短くてもかまいません。概念的には、ガイドラインは合格基準を定義する「ミニプロンプト」と考えることができます。必要に応じて、マークダウンの書式設定 (箇条書きなど) を含めることができます。

ステップ 3: サンプル評価データセットを作成する

各inputsは mlflow.genai.evaluate()によってアプリに渡されます。

Python
eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ]
        },
    },
]
print(eval_dataset)

ステップ 4: カスタムスコアラーを使用してアプリを評価する

最後に、評価を 2 回実行して、ガイドライン採点者の判断を失礼/冗長 (最初のスクリーンショット) と丁寧/非冗長 (2 番目のスクリーンショット) アプリのバージョンで比較できるようにします。

Python
from mlflow.genai.scorers import Guidelines
import mlflow

# First, let's evaluate the app's responses against the guidelines when it is not rude and verbose
BE_RUDE_AND_VERBOSE = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="structure", guidelines=structure),
        Guidelines(name="banned_topics", guidelines=banned_topics),
        Guidelines(name="relevance", guidelines=relevance),
    ],
)


# Next, let's evaluate the app's responses against the guidelines when it IS rude and verbose
BE_RUDE_AND_VERBOSE = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="structure", guidelines=structure),
        Guidelines(name="banned_topics", guidelines=banned_topics),
        Guidelines(name="relevance", guidelines=relevance),
    ],
)

事前構築済みスコアラーからの評価が失礼で冗長

事前構築済みスコアラーからの丁寧で冗長でない評価

2.ガイドラインジャッジをラップするカスタムスコアラーを作成する

このガイドでは、 API をラップし、結果のスコアラーでオフライン評価を実行するカスタムスコアラー judges.meets_guidelines()の作成を追加します。これらの同じスコアラーをスケジュールして実行します本番運用アプリケーションの品質を継続的に監視します。

手順 1: 評価するサンプルアプリを作成する

OpenAI クライアントを初期化して、Databricks でホストされている LLM または OpenAI でホストされている LLM に接続します。

Databricks-hosted LLMs
OpenAI-hosted LLMs

MLflow を使用して、Databricks でホストされている LLM に接続する OpenAI クライアントを取得します。利用可能な基盤モデルからモデルを選択します。

Python
import mlflow
from databricks.sdk import WorkspaceClient

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

ネイティブの OpenAI SDK を使用して、OpenAI でホストされるモデルに接続します。利用可能なOpenAIモデルからモデルを選択します。

Python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

顧客サポートアプリを定義します。

Python
from typing import List, Dict


# This is a global variable that is used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
FOLLOW_POLICIES = False

# This is a global variable that is used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
BE_RUDE_AND_VERBOSE = False

@mlflow.trace
def customer_support_agent(user_messages: List[Dict[str, str]], user_id: str):

 # 1. Fake policies to follow.
 @mlflow.trace
 def get_policies_for_user(user_id: str):
     if user_id == 1:
         return [
             "All returns must be processed within 30 days of purchase, with a valid receipt.",
         ]
     else:
         return [
             "All returns must be processed within 90 days of purchase, with a valid receipt.",
         ]

 policies_to_follow = get_policies_for_user(user_id)

 # 2. Prepare messages for the LLM
 # We will use this toggle later to see how the scorers handle rude and verbose responses
 system_prompt_postfix = (
     f"Follow the following policies: {policies_to_follow}.  Do not refer to the specific policies in your response.\n"
     if FOLLOW_POLICIES
     else ""
 )

 system_prompt_postfix = (
     f"{system_prompt_postfix}Be super rude and very verbose in your responses.\n"
     if BE_RUDE_AND_VERBOSE
     else system_prompt_postfix
 )
 messages_for_llm = [
     {
         "role": "system",
         "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
     },
     *user_messages,
 ]

 # 3. Call LLM to generate a response
 output = client.chat.completions.create(
     model=model_name,  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
     messages=messages_for_llm,
 )

 return {
     "message": output.choices[0].message.content,
     "policies_followed": policies_to_follow,
 }

result = customer_support_agent(
 user_messages=[
     {"role": "user", "content": "How much does a microwave cost?"},
 ],
 user_id=1
)
print(result)

ステップ 2: 評価基準を定義し、カスタムスコアラーとしてラップする

通常、ビジネスの利害関係者と協力してガイドラインを定義します。ここでは、いくつかのサンプルガイドラインを定義し、カスタムスコアラーを使用してアプリの入力/出力スキーマに結び付けます。

Python
from mlflow.genai.scorers import scorer
from mlflow.genai.judges import meets_guidelines
import json
from typing import Dict, Any


tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
structure = "The response must use clear, concise language and structures responses logically.  It must avoids jargon or explains technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."
relevance = "The response must be relevant to the user's request.  Only consider the relevance and nothing else. If the request is not clear, the response must ask for more information."
# Note in this guideline how we refer to `provided_policies` - we will make the meets_guidelines LLM judge aware of this data.
follows_policies_guideline = "If the provided_policies is relevant to the request and response, the response must adhere to the provided_policies."

# Define a custom scorer that wraps the guidelines LLM judge to check if the response follows the policies
@scorer
def follows_policies(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    # we directly return the Feedback object from the guidelines LLM judge, but we could have post-processed it before returning it.
    return meets_guidelines(
        name="follows_policies",
        guidelines=follows_policies_guideline,
        context={
            # Here we make meets_guidelines aware of
            "provided_policies": outputs["policies_followed"],
            "response": outputs["message"],
            "request": json.dumps(inputs["user_messages"]),
        },
    )


# Define a custom scorer that wraps the guidelines LLM judge to pass the custom keys from the inputs/outputs to the guidelines LLM judge
@scorer
def check_guidelines(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    feedbacks = []

    request = json.dumps(inputs["user_messages"])
    response = outputs["message"]

    feedbacks.append(
        meets_guidelines(
            name="tone",
            guidelines=tone,
            # Note: While we used request and response as keys, we could have used any key as long as our guideline referred to that key by name (e.g., if we had used output instead of response, we would have changed our guideline to be "The output must be polite")
            context={"response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="structure",
            guidelines=structure,
            context={"response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="banned_topics",
            guidelines=banned_topics,
            context={"request": request, "response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="relevance",
            guidelines=relevance,
            context={"request": request, "response": response},
        )
    )

    # A scorer can return a list of Feedback objects OR a single Feedback object.
    return feedbacks

注記

ステップ 3: サンプル評価データセットを作成する

各inputsは mlflow.genai.evaluate()によってアプリに渡されます。

Python
eval_dataset = [
    {
        "inputs": {
            # Note that these keys match the **kwargs of our application.
            "user_messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
            "user_id": 3,
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
            "user_id": 1,  # the bot should say no if the policies are followed for this user
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
            "user_id": 2,  # the bot should say yes if the policies are followed for this user
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
            "user_id": 3,
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
            "user_id": 1,
        },
    },
]

print(eval_dataset)

ステップ 4: ガイドラインを使用してアプリを評価する

Python
import mlflow

# Now, let's evaluate the app's responses against the guidelines when it is NOT rude and verbose and DOES follow policies
BE_RUDE_AND_VERBOSE = False
FOLLOW_POLICIES = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[follows_policies, check_guidelines],
)


# Now, let's evaluate the app's responses against the guidelines when it IS rude and verbose and does NOT follow policies
BE_RUDE_AND_VERBOSE = True
FOLLOW_POLICIES = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[follows_policies, check_guidelines],
)

カスタムスコアラーからの評価が失礼で冗長

カスタムスコアラーからの丁寧で冗長でない評価

次のステップ

プロンプトベースの採点者の作成 - カスタムプロンプトと複数の出力選択肢により、より複雑なジャッジを構築
スコアラーと評価を実行する - カスタムガイドラインのスコアラーを包括的な評価に使用します
ガイドラインの概念リファレンス - ガイドラインの判断が内部でどのように機能するかを理解する

概要​

1. 事前に作成されたガイドラインスコアラーを使用する​

手順 1: 評価するサンプル アプリを作成する​

ステップ 2: 評価基準を定義する​

ステップ 3: サンプル評価データセットを作成する​

ステップ 4: カスタム スコアラーを使用してアプリを評価する​

2.ガイドラインジャッジをラップするカスタムスコアラーを作成する​

手順 1: 評価するサンプル アプリを作成する​

ステップ 2: 評価基準を定義し、カスタムスコアラーとしてラップする​

ステップ 3: サンプル評価データセットを作成する​

ステップ 4: ガイドラインを使用してアプリを評価する​

次のステップ​

概要

1. 事前に作成されたガイドラインスコアラーを使用する

手順 1: 評価するサンプルアプリを作成する

ステップ 2: 評価基準を定義する

ステップ 3: サンプル評価データセットを作成する

ステップ 4: カスタムスコアラーを使用してアプリを評価する

2.ガイドラインジャッジをラップするカスタムスコアラーを作成する

手順 1: 評価するサンプルアプリを作成する

ステップ 2: 評価基準を定義し、カスタムスコアラーとしてラップする

ステップ 3: サンプル評価データセットを作成する

ステップ 4: ガイドラインを使用してアプリを評価する

次のステップ