事前定義されたLLMスコアラーを使用する

概要

MLflow は、Scorers MLflow の研究に裏打ちされた LLM ジャッジをラップし、一般的な品質ディメンション全体でトレースを評価できるバルトイン LLM を提供します。

important

通常、事前定義されたスコアラーを使用して評価を開始できますが、アプリケーションのロジックと評価基準が複雑になる (または、アプリケーションのトレースがスコアラーの要件を満たしていない) と、基になるジャッジをカスタムスコアラーでラップするか、カスタム LLM スコアラーを作成するように切り替えます。

ヒント

代わりにカスタムスコアラーを使用する場合:

アプリケーションに、定義済みのスコアラーが解析できない複雑な入力/出力がある
特定のビジネスロジックまたはドメイン固有の条件を評価する必要があります
複数の評価アスペクトを 1 つのスコアラーに組み合わせたい
トレース構造が事前定義のスコアラー要件と一致しません

詳細な例については、カスタムスコアラーガイドとカスタムLLMジャッジガイドを参照してください。

事前定義されたスコアラーの仕組み

evaluate()またはモニタリングサービスのいずれかによってトレースを通過すると、事前定義されたスコアラーは次のことを行います。

traceを解析して、ラップする LLM ジャッジに必要なデータを抽出します。
LLM ジャッジを呼び出してFeedbackを生成します
- フィードバックには、スコアの理由を説明する書面による理論的根拠とともに、 yes/no スコアが含まれています。
フィードバックを呼び出し元に返して、トレースにアタッチします

注記

MLflow が入力を Scorer に渡し、Scorer からの結果のフィードバックを Trace にアタッチする方法の詳細については、 Scorer の概念ガイドを参照してください。

前提条件

次のコマンドを実行して、MLflow 3.0 と OpenAI パッケージをインストールします。
Bash
```
pip install --upgrade "mlflow[databricks]>=3.1.0" openai
```
トレースのクイックスタートに従って、開発環境を MLflow エクスペリメントに接続します。

ステップ 1: 評価するサンプルアプリケーションを作成する

以下では、偽のレトリーバーを使用した簡単なアプリケーションを定義します。

Python
import os
import mlflow
from openai import OpenAI
from mlflow.entities import Document
from typing import List

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)


# Retriever function called by the sample app
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
    return [
        Document(
            id="sql_doc_1",
            page_content="SELECT is a fundamental SQL command used to retrieve data from a database. You can specify columns and use a WHERE clause to filter results.",
            metadata={"doc_uri": "http://example.com/sql/select_statement"},
        ),
        Document(
            id="sql_doc_2",
            page_content="JOIN clauses in SQL are used to combine rows from two or more tables, based on a related column between them. Common types include INNER JOIN, LEFT JOIN, and RIGHT JOIN.",
            metadata={"doc_uri": "http://example.com/sql/join_clauses"},
        ),
        Document(
            id="sql_doc_3",
            page_content="Aggregate functions in SQL, such as COUNT(), SUM(), AVG(), MIN(), and MAX(), perform calculations on a set of values and return a single summary value.  The most common aggregate function in SQL is COUNT().",
            metadata={"doc_uri": "http://example.com/sql/aggregate_functions"},
        ),
    ]


# Sample app that we will evaluate
@mlflow.trace
def sample_app(query: str):
    # 1. Retrieve documents based on the query
    retrieved_documents = retrieve_docs(query=query)
    retrieved_docs_text = "\n".join([doc.page_content for doc in retrieved_documents])

    # 2. Prepare messages for the LLM
    messages_for_llm = [
        {
            "role": "system",
            # Fake prompt to show how the various scorers identify quality issues.
            "content": f"Answer the user's question based on the following retrieved context: {retrieved_docs_text}.  Do not mention the fact that provided context exists in your answer.  If the context is not relevant to the question, generate the best response you can.",
        },
        {
            "role": "user",
            "content": query,
        },
    ]

    # 3. Call LLM to generate the response
    return client.chat.completions.create(
        # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        model="databricks-claude-3-7-sonnet",
        messages=messages_for_llm,
    )
result = sample_app("what is select in sql?")
print(result)

ステップ 2: サンプル評価データセットを作成する

注記

expected_facts は、グラウンドトゥルースを必要とする事前定義されたスコアラーを使用する場合にのみ必要です。

Python
eval_dataset = [
    {
        "inputs": {"query": "What is the most common aggregate function in SQL?"},
        "expectations": {
            "expected_facts": ["Most common aggregate function in SQL is COUNT()."],
        },
    },
    {
        "inputs": {"query": "How do I use MLflow?"},
        "expectations": {
            "expected_facts": [
                "MLflow is a tool for managing and tracking machine learning experiments."
            ],
        },
    },
]
print(eval_dataset)

ステップ3:事前定義されたスコアラーで評価を実行する

では、上記で定義したスコアラーで評価を実行してみましょう。

Python
from mlflow.genai.scorers import (
    Correctness,
    Guidelines,
    RelevanceToQuery,
    RetrievalGroundedness,
    RetrievalRelevance,
    RetrievalSufficiency,
    Safety,
)


# Run predefined scorers that require ground truth
mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app,
    scorers=[
        Correctness(),
        # RelevanceToQuery(),
        # RetrievalGroundedness(),
        # RetrievalRelevance(),
        RetrievalSufficiency(),
        # Safety(),
    ],
)


# Run predefined scorers that do NOT require ground truth
mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app,
    scorers=[
        # Correctness(),
        RelevanceToQuery(),
        RetrievalGroundedness(),
        RetrievalRelevance(),
        # RetrievalSufficiency(),
        Safety(),
        Guidelines(name="does_not_mention", guidelines="The response not mention the fact that provided context exists.")
    ],
)

評価トレース

評価UI

利用可能なスコアラー

スコアラー	何を評価するのか?	グラウンドトゥルースが必要ですか?	詳細を表示
`RelevanceToQuery`	アプリのレスポンスは、ユーザーの入力に直接対応していますか?	No	回答とコンテキストの関連性ガイド
`Safety`	アプリのレスポンスは、有害または有害なコンテンツを避けていますか?	No	安全ガイド
`RetrievalGroundedness`	アプリの応答は、取得した情報に基づいていますか?	No	接地性ガイド
`RetrievalRelevance`	取得したドキュメントはユーザーのリクエストに関連していますか?	No	回答とコンテキストの関連性ガイド
`Correctness`	アプリの応答はグラウンドトゥルースと比較して正しいですか?	Yes	正確性ガイド
`RetrievalSufficiency`	取得したドキュメントには必要な情報がすべて含まれていますか?	Yes	コンテキスト充足ガイド

次のステップ

これらの推奨アクションとチュートリアルで旅を続けてください。

カスタムスコアラーの作成 - 特定のニーズに合わせてコードベースのメトリクスを構築します
カスタムLLMスコアラーの作成 - LLMを使用して高度な評価基準を設計します
アプリの評価 - 定義済みのスコアラーの動作を完全な例でご覧ください

リファレンスガイド

このガイドで説明されている概念と機能の詳細なドキュメントをご覧ください。

Prebuilt judges & scorers reference - 利用可能なすべてのジャッジの包括的な概要
スコアラー - スコアラーの働き方と評価における彼らの役割を理解する
LLM ジャッジ - 基礎となるジャッジのアーキテクチャについて学ぶ

概要​

事前定義されたスコアラーの仕組み​

前提 条件​

ステップ 1: 評価するサンプル アプリケーションを作成する​

ステップ 2: サンプル評価データセットを作成する​

ステップ3:事前定義されたスコアラーで評価を実行する​

利用可能なスコアラー​

次のステップ​

リファレンスガイド​

概要