本番運用の品質モニタリング (スコアラーの自動実行)

備考

ベータ版

この機能はベータ版です。

MLflow では、本番運用トレースのサンプルに対してスコアラーを自動的に実行し、品質を継続的に監視することができます。

主な利点:

手動による介入のない 自動品質評価
カバレッジと計算コストのバランスをとる ための柔軟なサンプリング
開発から同じスコアラーを使用した 一貫した評価
定期的なバックグラウンド実行による 継続的なモニタリング

前提条件

MLflow と必要なパッケージをインストールする
Bash
```
pip install --upgrade "mlflow[databricks]>=3.1.0" openai
```
MLflow エクスペリメントを作成するには、環境のセットアップに関するクイックスタートに従ってください。
トレースを使用した本番運用アプリケーションのMLflow インストゥルメント化
モニタリング出力を保存するためのCREATE TABLE権限を持つUnity Catalogスキーマへのアクセス。

注記

Databricks 試用版アカウントを使用している場合は、Unity Catalog スキーマ workspace.defaultに対する CREATE TABLE 権限があります。

ステップ1:本番運用トレースで採点者をテストする

まず、本番運用で使用するスコアラーがトレースを評価できるかどうかをテストする必要があります。

ヒント

開発時にmlflow.genai.evaluate()で本番運用アプリをpredict_fnとして使用していた場合、スコアラーはすでに互換性がある可能性があります。

警告

MLflow 現在、本番運用モニタリングのための事前定義されたスコアラーの使用のみをサポートしています。本番運用でカスタムコードベースまたは Databricksベースのスコアラーを実行する必要がある場合は、アカウント担当者にお問い合わせください。LLM

mlflow.genai.evaluate()を使用して、トレースのサンプルでスコアラーをテストします

Python
import mlflow

from mlflow.genai.scorers import (
    Guidelines,
    RelevanceToQuery,
    RetrievalGroundedness,
    RetrievalRelevance,
    Safety,
)

# Get a sample of up to 10 traces from your experiment
traces = mlflow.search_traces(max_results=10)

# Run evaluation to test the scorers
mlflow.genai.evaluate(
    data=traces,
    scorers=[
        RelevanceToQuery(),
        RetrievalGroundedness(),
        RetrievalRelevance(),
        Safety(),
        Guidelines(
            name="mlflow_only",
            # Guidelines can refer to the request and response.
            guidelines="If the request is unrelated to MLflow, the response must refuse to answer.",
        ),
        # You can have any number of guidelines.
        Guidelines(
            name="customer_service_tone",
            guidelines="""The response must maintain our brand voice which is:
    - Professional yet warm and conversational (avoid corporate jargon)
    - Empathetic, acknowledging emotional context before jumping to solutions
    - Proactive in offering help without being pushy

    Specifically:
    - If the customer expresses frustration, anger, or disappointment, the first sentence must acknowledge their emotion
    - The response must use "I" statements to take ownership (e.g., "I understand" not "We understand")
    - The response must avoid phrases that minimize concerns like "simply", "just", or "obviously"
    - The response must end with a specific next step or open-ended offer to help, not generic closings""",
        ),
    ],
)

MLflow トレース UI を使用して、実行されたスコアラーを確認する

この場合、 RetrievalGroundedness() と RetrievalRelevance() のスコアラーを実行したにもかかわらず、MLflow UI に表示されないことがわかります。これは、これらのスコアラーがトレースを操作しないことを示しており、次のステップで有効にすべきではありません。

ステップ 2: モニタリングを有効にする

それでは、モニタリングサービスを有効にしましょう。有効にすると、モニタリングサービスは、評価されたトレースのコピーをMLflow エクスペリメントから、指定したスキーマのDelta Unity Catalogテーブルに同期します。

important

一度設定すると、Unity Catalog スキーマは変更できません。

Using the UI
Using the SDK

以下の記録に従って、UI を使用して、手順 1 で正常に実行されたスコアラーを有効にします。サンプリングレートを選択すると、その割合のトレースでのみスコアラーが実行されます(たとえば、「 1.0 を入力すると、トレースの100%でスコアラーが実行され、 .2 は20%で実行されます)。

スコアラーごとのサンプリングレートを設定する場合は、SDK を使用する必要があります。

trace

次のコードスニペットを使用して、手順 1 で正常に実行されたスコアラーを有効にします。サンプリングレートを選択すると、その割合のトレースでのみスコアラーが実行されます(たとえば、「 1.0 を入力すると、トレースの100%でスコアラーが実行され、 .2 は20%で実行されます)。オプションで、スコアラーごとのサンプリングレートを設定できます。

Python
# These packages are automatically installed with mlflow[databricks]
from databricks.agents.monitoring import create_external_monitor, AssessmentsSuiteConfig, BuiltinJudge, GuidelinesJudge

external_monitor = create_external_monitor(
    # Change to a Unity Catalog schema where you have CREATE TABLE permissions.
    catalog_name="workspace",
    schema_name="default",
    assessments_config=AssessmentsSuiteConfig(
        sample=1.0,  # sampling rate
        assessments=[
            # Predefined scorers "safety", "groundedness", "relevance_to_query", "chunk_relevance"
            BuiltinJudge(name="safety"),  # or {'name': 'safety'}
            BuiltinJudge(
                name="groundedness", sample_rate=0.4
            ),  # or {'name': 'groundedness', 'sample_rate': 0.4}
            BuiltinJudge(
                name="relevance_to_query"
            ),  # or {'name': 'relevance_to_query'}
            BuiltinJudge(name="chunk_relevance"),  # or {'name': 'chunk_relevance'}
            # Guidelines can refer to the request and response.
            GuidelinesJudge(
                guidelines={
                    # You can have any number of guidelines, each defined as a key-value pair.
                    "mlflow_only": [
                        "If the request is unrelated to MLflow, the response must refuse to answer."
                    ],  # Must be an array of strings
                    "customer_service_tone": [
                        """The response must maintain our brand voice which is:
    - Professional yet warm and conversational (avoid corporate jargon)
    - Empathetic, acknowledging emotional context before jumping to solutions
    - Proactive in offering help without being pushy

    Specifically:
    - If the customer expresses frustration, anger, or disappointment, the first sentence must acknowledge their emotion
    - The response must use "I" statements to take ownership (e.g., "I understand" not "We understand")
    - The response must avoid phrases that minimize concerns like "simply", "just", or "obviously"
    - The response must end with a specific next step or open-ended offer to help, not generic closings"""
                    ],
                }
            ),
        ],
    ),
)

print(external_monitor)

ステップ3.モニターの更新

スコアラーの設定を変更するには、 update_external_monitor()を使用します。設定はステートレスです - つまり、更新によって完全に上書きされます。変更する既存の設定を取得するには、 get_external_monitor()を使用します。

Using the UI
Using the SDK

以下の録画に従って、UIを使用してスコアラーを更新してください。

trace

Python
# These packages are automatically installed with mlflow[databricks]
from databricks.agents.monitoring import update_external_monitor, get_external_monitor
import os

config = get_external_monitor(experiment_id=os.environ["MLFLOW_EXPERIMENT_ID"])
print(config)


external_monitor = update_external_monitor(
    # You must pass the experiment_id of the experiment you want to update.
    experiment_id=os.environ["MLFLOW_EXPERIMENT_ID"],
    # Change to a Unity Catalog schema where you have CREATE TABLE permissions.
    assessments_config=AssessmentsSuiteConfig(
        sample=1.0,  # sampling rate
        assessments=[
            # Predefined scorers "safety", "groundedness", "relevance_to_query", "chunk_relevance"
            BuiltinJudge(name="safety"),  # or {'name': 'safety'}
            BuiltinJudge(
                name="groundedness", sample_rate=0.4
            ),  # or {'name': 'groundedness', 'sample_rate': 0.4}
            BuiltinJudge(
                name="relevance_to_query"
            ),  # or {'name': 'relevance_to_query'}
            BuiltinJudge(name="chunk_relevance"),  # or {'name': 'chunk_relevance'}
            # Guidelines can refer to the request and response.
            GuidelinesJudge(
                guidelines={
                    # You can have any number of guidelines, each defined as a key-value pair.
                    "mlflow_only": [
                        "If the request is unrelated to MLflow, the response must refuse to answer."
                    ],  # Must be an array of strings
                    "customer_service_tone": [
                        """The response must maintain our brand voice which is:
    - Professional yet warm and conversational (avoid corporate jargon)
    - Empathetic, acknowledging emotional context before jumping to solutions
    - Proactive in offering help without being pushy

    Specifically:
    - If the customer expresses frustration, anger, or disappointment, the first sentence must acknowledge their emotion
    - The response must use "I" statements to take ownership (e.g., "I understand" not "We understand")
    - The response must avoid phrases that minimize concerns like "simply", "just", or "obviously"
    - The response must end with a specific next step or open-ended offer to help, not generic closings"""
                    ],
                }
            ),
        ],
    ),
)

print(external_monitor)

ステップ4.モニタリング結果の使用

モニタリングジョブを初めて実行するには、~15 分から 30 分かかります。最初の実行後、15 分ごとに実行されます。本番運用のトラフィックが大量にある場合、ジョブの完了にさらに時間がかかる可能性があることに注意してください。

ジョブが実行されるたびに、次のことが行われます。

トレースのサンプルに対して各スコアラーを実行します
- スコアラーごとにサンプリングレートが異なる場合、モニタリングジョブは、同じトレースをできるだけ多くスコアリングしようとします。たとえば、スコアラー A のサンプリングレートが 20%で、スコアラー B のサンプリングレートが 40% の場合、トレースの同じ 20% が A と B に使用されます。
スコアラーからのフィードバックを MLflow エクスペリメントの各トレースに添付します
すべてのトレース (サンプリングされたトレースだけでなく) のコピーを、手順 1 で構成された Delta テーブルに書き込みます。

モニタリング結果は、 MLflow エクスペリメントのTraceタブを使用して表示できます。または、生成された Delta テーブルで SQL または Spark を使用してトレースのクエリを実行することもできます。

次のステップ

これらの推奨アクションとチュートリアルで旅を続けてください。

本番運用トレースを使用してアプリの品質を向上させる - LLM を使用してセマンティック評価を作成する
評価データセットの構築 - モニタリングの結果を使用して、パフォーマンスの低いトレースを評価データセットにキュレーションし、品質を向上させます。

リファレンスガイド

このガイドで説明されている概念と機能の詳細なドキュメントをご覧ください。

本番運用モニタリング - 本番運用モニタリング SDK の詳細

前提 条件​

ステップ1:本番運用トレースで採点者をテストする​

ステップ 2: モニタリングを有効にする​

ステップ3.モニターの更新​

ステップ4.モニタリング結果の使用​

次のステップ​

リファレンスガイド​