事前定義されたジャッジとスコアラー

概要

MLflow は、SDK として利用可能な一般的な品質チェックのために、研究に裏打ちされた審査員を事前定義されたスコアラーとしてラップします。

important

審査員は単体のAPIs として利用することもできますが、評価ハーネスや本番運用モニタリングサービスで使用するためには、スコアラーで包む必要があります。MLflow には、スコアラーの事前定義された実装が用意されていますが、より高度なユースケースのためにジャッジ APIs を使用するカスタムスコアラーを作成することもできます。

裁判官	キー入力	グラウンドトゥルースが必要	何を評価するのか?	事前定義されたスコアラーで利用可能
`is_context_relevant`	`request`, `context`	No	その `context` は、無関係なトピックに逸脱することなく、ユーザーの `request` に直接関連していますか?	`RelevanceToQuery` `RetrievalRelevance`
`is_safe`	`content`	No	`content`には有害、攻撃的、または有毒な物質が含まれていますか?	`Safety`
`is_grounded`	`request`、`response`、 `context`	No	`request`への`response`は、`context`で提供される情報に基づいていますか(たとえば、アプリが応答を幻覚見していないなど)?	`RetrievalGroundedness`
`is_correct`	`request`、`response`、 `expected_facts`	Yes	提供されたグラウンドトゥルース`expected_facts`と比較して、`request`への`response`は正しいですか?	`Correctness`
`is_context_sufficient`	`request`、`context`、 `expected_facts`	Yes	その`context`は、特定の`request`のグラウンドトゥルース`expected_facts`を含むレスポンスを生成するために必要なすべての情報を提供していますか?	`RetrievalSufficiency`

例を実行するための前提条件

MLflow と必要なパッケージをインストールする
Bash
```
pip install --upgrade "mlflow[databricks]>=3.1.0"
```
MLflow エクスペリメントを作成するには、環境のセットアップに関するクイックスタートに従ってください。

プリビルドジャッジの3つの使い方

あらかじめ用意されたジャッジの使い方は3つあります。

1. SDKを直接経由する

SDK を介してジャッジを直接呼び出すと、ジャッジをアプリケーションに直接インターゲートできます。たとえば、レスポンスをユーザーに返す前に、レスポンスの接地性を確認したい場合があります。

以下は、 is_grounded judge SDKの使用例です。その他の例については、各審査員のページを参照してください。

Python
from mlflow.genai.judges import is_grounded

result = is_grounded(
    request="What is the capital of France?",
    response="Paris",
    context="Paris is the capital of France.",
)
# result is...
# mlflow.entities.Assessment.Feedback(
#     rationale="The response asks 'What is the capital of France?' and answers 'Paris'. The retrieved context states 'Paris is the capital of France.' This directly supports the answer given in the response.",
#     feedback=FeedbackValue(value=<CategoricalRating.YES: 'yes'>)
# )

result = is_grounded(
    request="What is the capital of France?",
    response="Paris",
    context="Paris is known for its Eiffel Tower.",
)

# result is...
# mlflow.entities.Assessment.Feedback(
#     rationale="The retrieved context states that 'Paris is known for its Eiffel Tower,' but it does not mention that Paris is the capital of France. Therefore, the response is not fully supported by the retrieved context.",
#     feedback=FeedbackValue(value=<CategoricalRating.NO: 'no'>)
# )

2. 事前構築済みのスコアラーを介して使用する

よりシンプルなアプリケーションの場合は、MLflow の事前定義されたスコアラーを使用して評価を開始できます。

以下は、 Correctness の事前定義スコアラーの使用例です。各審査員のページでは、その他の例と、事前定義されたスコアラーを使用するために必要なトレースデータスキーマを参照してください。

Python
eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "Paris is the magnificent capital city of France, a stunning metropolis known worldwide for its iconic Eiffel Tower, rich cultural heritage, beautiful architecture, world-class museums like the Louvre, and its status as one of Europe's most important political and economic centers. As the capital city, Paris serves as the seat of France's government and is home to numerous important national institutions."
        },
        "expectations": {
            "expected_facts": ["Paris is the capital of France."],
        },
    },
]


from mlflow.genai.scorers import Correctness


eval_results = mlflow.genai.evaluate(data=eval_dataset, scorers=[Correctness])

3. カスタムスコアラーでの使用

アプリケーションのロジックと評価基準が複雑になる場合、ジャッジに渡されるデータをより詳細に制御する必要がある場合、またはアプリケーションのトレースが事前定義されたスコアラーの要件を満たしていない場合は、ジャッジの SDK をカスタムスコアラーでラップできます

以下は、 is_grounded ジャッジSDKをカスタムスコアラーでラップする例です。

Python
from mlflow.genai.judges import is_grounded
from mlflow.genai.scorers import scorer

eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "Paris",
            "retrieved_context": [
                {
                    "content": "Paris is the capital of France.",
                    "source": "wikipedia",
                }
            ],
        },
    },
]

@scorer
def is_grounded_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    return is_grounded(
        request=inputs["query"],
        response=outputs["response"],
        context=outputs["retrieved_context"],
    )

eval_results = mlflow.genai.evaluate(data=eval_dataset, scorers=[is_grounded_scorer])

評価結果

次のステップ

評価で定義済みのスコアラーを使用する - 組み込み quality メトリクスの概要
カスタムジャッジの作成 - 特定のニーズに合わせたジャッジを構築します
評価の実行 - ジャッジを適用して、アプリケーションを体系的に評価します

概要​

例を実行するための前提条件​

プリビルドジャッジの3つの使い方​

1. SDKを直接経由する​

2. 事前構築済みのスコアラーを介して使用する​

3. カスタムスコアラーでの使用​

次のステップ​

概要