正しさのジャッジ&スコアラー

judges.is_correct()事前定義されたジャッジは、提供されたグラウンドトゥルース情報(expected_factsまたはexpected_response)と比較することにより、生成AIアプリケーションの応答が事実上正しいかどうかを評価します。

このジャッジは、既知の正解に対するアプリケーションの応答を評価するために、事前定義された Correctness スコアラーを通じて利用できます。

API シグネチャ

Python
from mlflow.genai.judges import is_correct

def is_correct(
    *,
    request: str,                               # User's question or query
    response: str,                              # Application's response to evaluate
    expected_facts: Optional[list[str]],        # List of expected facts (provide either expected_response or expected_facts)
    expected_response: Optional[str] = None,    #  Ground truth response (provide either expected_response or expected_facts)
    name: Optional[str] = None                  # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
    """Returns Feedback with 'yes' or 'no' value and a rationale"""

例を実行するための前提条件

MLflow と必要なパッケージをインストールする
Bash
```
pip install --upgrade "mlflow[databricks]>=3.1.0"
```
MLflow エクスペリメントを作成するには、環境のセットアップに関するクイックスタートに従ってください。

SDKの直接使用

Python
from mlflow.genai.judges import is_correct

# Example 1: Response contains expected facts
feedback = is_correct(
    request="What is MLflow?",
    response="MLflow is an open-source platform for managing the ML lifecycle.",
    expected_facts=[
        "MLflow is open-source",
        "MLflow is a platform for ML lifecycle"
    ]
)
print(feedback.value)  # "yes"
print(feedback.rationale)  # Explanation of correctness

# Example 2: Response missing or contradicting facts
feedback = is_correct(
    request="When was MLflow released?",
    response="MLflow was released in 2017.",
    expected_facts=["MLflow was released in June 2018"]
)
print(feedback.value)  # "no"
print(feedback.rationale)  # Explanation of what's incorrect

事前構築済みのスコアラーを使用する

is_correctジャッジは、Correctnessの事前構築済みスコアラーを通じて利用できます。

要件：

トレース要件 : inputs と outputs はトレースのルートスパン上にある必要があります
グラウンドトゥルースラベル :必須 - expectations辞書にexpected_factsまたはexpected_responseを提供する必要があります

Python
from mlflow.genai.scorers import Correctness

# Create evaluation dataset with ground truth
eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "Paris is the magnificent capital city of France, known for the Eiffel Tower and rich culture."
        },
        "expectations": {
            "expected_facts": ["Paris is the capital of France."]
        },
    },
    {
        "inputs": {"query": "What are the main components of MLflow?"},
        "outputs": {
            "response": "MLflow has four main components: Tracking, Projects, Models, and Registry."
        },
        "expectations": {
            "expected_facts": [
                "MLflow has four main components",
                "Components include Tracking",
                "Components include Projects",
                "Components include Models",
                "Components include Registry"
            ]
        },
    },
    {
        "inputs": {"query": "When was MLflow released?"},
        "outputs": {
            "response": "MLflow was released in 2017 by Databricks."
        },
        "expectations": {
            "expected_facts": ["MLflow was released in June 2018"]
        },
    }
]

# Run evaluation with Correctness scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[Correctness()]
)

代替案: expected_response

expected_factsの代わりにexpected_responseを使用することもできます。

Python
eval_dataset_with_response = [
    {
        "inputs": {"query": "What is MLflow?"},
        "outputs": {
            "response": "MLflow is an open-source platform for managing the ML lifecycle."
        },
        "expectations": {
            "expected_response": "MLflow is an open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment."
        },
    }
]

# Run evaluation with expected_response
eval_results = mlflow.genai.evaluate(
    data=eval_dataset_with_response,
    scorers=[Correctness()]
)

ヒント

expected_factsを使用すると、より柔軟な評価が可能になるため、expected_responseよりも推奨されます - 応答は単語ごとに一致させる必要はなく、重要な事実を含めるだけです。

カスタムスコアラーでの使用

事前定義されたスコアラーの要件とは異なるデータ構造を持つアプリケーションを評価する場合は、ジャッジをカスタムスコアラーで包みます。

Python
from mlflow.genai.judges import is_correct
from mlflow.genai.scorers import scorer
from typing import Dict, Any

eval_dataset = [
    {
        "inputs": {"question": "What are the main components of MLflow?"},
        "outputs": {
            "answer": "MLflow has four main components: Tracking, Projects, Models, and Registry."
        },
        "expectations": {
            "facts": [
                "MLflow has four main components",
                "Components include Tracking",
                "Components include Projects",
                "Components include Models",
                "Components include Registry"
            ]
        }
    },
    {
        "inputs": {"question": "What is MLflow used for?"},
        "outputs": {
            "answer": "MLflow is used for building websites."
        },
        "expectations": {
            "facts": [
                "MLflow is used for managing ML lifecycle",
                "MLflow helps with experiment tracking"
            ]
        }
    }
]

@scorer
def correctness_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any], expectations: Dict[Any, Any]):
    return is_correct(
        request=inputs["question"],
        response=outputs["answer"],
        expected_facts=expectations["facts"]
    )

# Run evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[correctness_scorer]
)

結果の解釈

ジャッジは、次の Feedback オブジェクトを返します。

value :回答が正解の場合は「はい」、不正解の場合は「いいえ」
rationale :どの事実が支持されているか、または欠落しているかについての詳細な説明

次のステップ

他の定義済みジャッジを見る - 他の組み込み品質評価ジャッジについて学習
カスタムジャッジの作成 - ドメイン固有の評価ジャッジを構築します
評価の実行 - 包括的なアプリケーション評価でジャッジを使用します

API シグネチャ​

例を実行するための前提条件​

SDKの直接使用​

事前構築済みのスコアラーを使用する​

代替案: expected_response​

カスタムスコアラーでの使用​

結果の解釈​

次のステップ​