正しさの判断

Correctnessジャッジは、提供された真実の情報 ( expected_factsまたはexpected_response ) と比較して、GenAI アプリケーションの応答が事実上正しいかどうかを評価します。

この組み込み LLM ジャッジは、既知の正解に対してアプリケーションの応答を評価するために設計されています。

例を実行するための前提条件

MLflow と必要なパッケージをインストールする

Python
%pip install --upgrade "mlflow[databricks]>=3.4.0"
dbutils.library.restartPython()

MLflow エクスペリメントを作成するには、環境のセットアップに関するクイックスタートに従ってください。

使用例

Invoke directly
Invoke with evaluate()

Python
from mlflow.genai.scorers import Correctness

correctness_judge = Correctness()

# Example 1: Response contains expected facts
feedback = correctness_judge(
    inputs={"request": "What is MLflow?"},
    outputs={"response": "MLflow is an open-source platform for managing the ML lifecycle."},
    expectations={
        "expected_facts": [
            "MLflow is open-source",
            "MLflow is a platform for ML lifecycle"
        ]
    }
)

print(feedback.value)  # "yes"
print(feedback.rationale)  # Explanation of which facts are supported

# Example 2: Response missing or contradicting facts
feedback = correctness_judge(
    inputs={"request": "When was MLflow released?"},
    outputs={"response": "MLflow was released in 2017."},
    expectations={"expected_facts": ["MLflow was released in June 2018"]}
)

# Example 3: Using expected_response instead of expected_facts
feedback = correctness_judge(
    inputs={"request": "What is the capital of France?"},
    outputs={"response": "The capital of France is Paris."},
    expectations={"expected_response": "Paris is the capital of France."}
)

Correctnessジャッジは、MLflow の評価フレームワークで直接使用できます。

要件：

トレース要件 : inputs と outputs はトレースのルートスパン上にある必要があります
グラウンドトゥルースラベル :必須 - expectations辞書にexpected_factsまたはexpected_responseを提供する必要があります

Python
from mlflow.genai.scorers import Correctness

# Create evaluation dataset with ground truth
eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "Paris is the magnificent capital city of France, known for the Eiffel Tower and rich culture."
        },
        "expectations": {
            "expected_facts": ["Paris is the capital of France."]
        },
    },
    {
        "inputs": {"query": "What are the main components of MLflow?"},
        "outputs": {
            "response": "MLflow has four main components: Tracking, Projects, Models, and Registry."
        },
        "expectations": {
            "expected_facts": [
                "MLflow has four main components",
                "Components include Tracking",
                "Components include Projects",
                "Components include Models",
                "Components include Registry"
            ]
        },
    },
    {
        "inputs": {"query": "When was MLflow released?"},
        "outputs": {
            "response": "MLflow was released in 2017 by Databricks."
        },
        "expectations": {
            "expected_facts": ["MLflow was released in June 2018"]
        },
    }
]

# Run evaluation with Correctness scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        Correctness(
            model="databricks:/databricks-gpt-oss-120b",  # Optional. Defaults to custom Databricks model.
        )
    ]
)

代替案: expected_response

expected_factsの代わりにexpected_responseを使用することもできます。

Python
eval_dataset_with_response = [
    {
        "inputs": {"query": "What is MLflow?"},
        "outputs": {
            "response": "MLflow is an open-source platform for managing the ML lifecycle."
        },
        "expectations": {
            "expected_response": "MLflow is an open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment."
        },
    }
]

# Run evaluation with expected_response
eval_results = mlflow.genai.evaluate(
    data=eval_dataset_with_response,
    scorers=[Correctness()]
)

ヒント

より柔軟な評価を行うには、 expected_responseではなくexpected_facts使用します。応答は逐語的に一致している必要はなく、重要な事実だけが含まれていれば十分です。

裁判官の力となるLLMを選択する

デフォルトでは、これらのジャッジは、GenAI 品質評価を実行するために設計された、Databricks がホストする LLM を使用します。ジャッジ定義内のmodel引数を使用して、ジャッジモデルを変更できます。モデルは<provider>:/<model-name>形式で指定する必要があります。ここで、 <provider>は LiteLLM 互換のモデルプロバイダーです。モデルプロバイダーとしてdatabricksを使用する場合、モデル名はサービスエンドポイント名と同じになります。

異なるジャッジモデルを提供することでジャッジをカスタマイズできます。

Python
from mlflow.genai.scorers import Correctness

# Use a different judge model
correctness_judge = Correctness(
    model="databricks:/databricks-gpt-5-mini"  # Or any LiteLLM-compatible model
)

# Use in evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[correctness_judge]
)

サポートされているモデルの一覧については、 MLflow のドキュメントを参照してください。

結果を解釈する

ジャッジは、次の Feedback オブジェクトを返します。

value :回答が正解の場合は「はい」、不正解の場合は「いいえ」
rationale :どの事実が支持されているか、または欠落しているかについての詳細な説明

次のステップ

他の組み込みジャッジについて調べる- 他の組み込み品質評価ジャッジについて学ぶ
カスタムジャッジの作成 - ドメイン固有の評価ジャッジを構築します
評価の実行 - 包括的なアプリケーション評価でジャッジを使用します

例を実行するための前提条件​

使用例​

代替案: expected_response​

裁判官の力となるLLMを選択する​

結果を解釈する​

次のステップ​

例を実行するための前提条件

使用例

代替案: expected_response

裁判官の力となるLLMを選択する

結果を解釈する

次のステップ