検索十分性判断

RetrievalSufficiencyジャッジは、 expected_factsまたはexpected_responseとして提供されたグラウンドトゥルースラベルに基づいて、取得されたコンテキスト (RAG アプリケーション、エージェント、またはドキュメントを取得する任意のシステムから) に、ユーザーの要求に適切に答えるのに十分な情報が含まれているかどうかを評価します。

この組み込み LLM ジャッジは、検索プロセスで必要なすべての情報が提供されていることを確認する必要がある RAG システムを評価するために設計されています。

例を実行するための前提条件

MLflow と必要なパッケージをインストールする
Bash
```
pip install --upgrade "mlflow[databricks]>=3.4.0"
```
MLflow エクスペリメントを作成するには、環境のセットアップに関するクイックスタートに従ってください。

使用例

RetrievalSufficiencyジャッジは、単一のトレースの評価のために直接呼び出すことも、バッチ評価のために MLflow の評価フレームワークと共に使用することもできます。

要件：

トレース要件 :
- MLflow トレースには、 span_type が 1 に設定されたスパンが少なくとも 1 つ含まれている必要があります。 RETRIEVER
- inputs また、 outputs トレースのルートスパン上にある必要があります
グラウンドトゥルースラベル :必須 - expectations辞書にexpected_factsまたはexpected_responseを提供する必要があります

Invoke directly
Invoke with evaluate()

Python
from mlflow.genai.scorers import retrieval_sufficiency
import mlflow

# Get a trace from a previous run
trace = mlflow.get_trace("<your-trace-id>")

# Assess if the retrieved context is sufficient for the expected facts
feedback = retrieval_sufficiency(
    trace=trace,
    expectations={
        "expected_facts": [
            "MLflow has four main components",
            "Components include Tracking",
            "Components include Projects",
            "Components include Models",
            "Components include Registry"
        ]
    }
)
print(feedback)

Python
import mlflow
from mlflow.genai.scorers import RetrievalSufficiency

# Evaluate traces from previous runs with ground truth expectations
results = mlflow.genai.evaluate(
    data=eval_dataset,  # Dataset with trace data and expected_facts
    scorers=[RetrievalSufficiency()]
)

RAGの例

RAG アプリケーションを作成し、取得したコンテキストが十分かどうかを評価する方法を示した完全な例を次に示します。

OpenAI クライアントを初期化して、Databricks でホストされている LLM または OpenAI でホストされている LLM に接続します。

Databricks-hosted LLMs
OpenAI-hosted LLMs

MLflow を使用して、Databricks でホストされている LLM に接続する OpenAI クライアントを取得します。利用可能な基盤モデルからモデルを選択します。

Python
import mlflow
from databricks.sdk import WorkspaceClient

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

ネイティブの OpenAI SDK を使用して、OpenAI でホストされるモデルに接続します。利用可能なOpenAIモデルからモデルを選択します。

Python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

RAG アプリケーションを定義して評価します。

Python
from mlflow.genai.scorers import RetrievalSufficiency
from mlflow.entities import Document
from typing import List


# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
    # Simulated retrieval - some queries return insufficient context
    if "capital of france" in query.lower():
        return [
            Document(
                id="doc_1",
                page_content="Paris is the capital of France.",
                metadata={"source": "geography.txt"}
            ),
            Document(
                id="doc_2",
                page_content="France is a country in Western Europe.",
                metadata={"source": "countries.txt"}
            )
        ]
    elif "mlflow components" in query.lower():
        # Incomplete retrieval - missing some components
        return [
            Document(
                id="doc_3",
                page_content="MLflow has multiple components including Tracking and Projects.",
                metadata={"source": "mlflow_intro.txt"}
            )
        ]
    else:
        return [
            Document(
                id="doc_4",
                page_content="General information about data science.",
                metadata={"source": "ds_basics.txt"}
            )
        ]

# Define your RAG app
@mlflow.trace
def rag_app(query: str):
    # Retrieve documents
    docs = retrieve_docs(query)
    context = "\n".join([doc.page_content for doc in docs])

    # Generate response
    messages = [
        {"role": "system", "content": f"Answer based on this context: {context}"},
        {"role": "user", "content": query}
    ]

    response = client.chat.completions.create(
        # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        model=model_name,
        messages=messages
    )

    return {"response": response.choices[0].message.content}

# Create evaluation dataset with ground truth
eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "expectations": {
            "expected_facts": ["Paris is the capital of France."]
        }
    },
    {
        "inputs": {"query": "What are all the MLflow components?"},
        "expectations": {
            "expected_facts": [
                "MLflow has four main components",
                "Components include Tracking",
                "Components include Projects",
                "Components include Models",
                "Components include Registry"
            ]
        }
    }
]

# Run evaluation with RetrievalSufficiency scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[
        RetrievalSufficiency(
            model="databricks:/databricks-gpt-oss-120b",  # Optional. Defaults to custom Databricks model.
        )
    ]
)

結果を理解する

RetrievalSufficiencyスコアラーは、各レトリーバーのスパンを個別に評価します。それは:

取得したドキュメントに、予期される事実を生成するために必要なすべての情報が含まれている場合は、「yes」を返します
取得したドキュメントに重要な情報が欠落している場合は、「no」を返し、何が欠落しているかを説明する根拠を示します

これにより、取得システムが必要な情報をすべてフェッチできていないこと(RAGアプリケーションでの応答が不完全または不正確である一般的な原因)を特定できます。

裁判官の力となるLLMを選択する

デフォルトでは、これらのジャッジは、GenAI 品質評価を実行するために設計された、Databricks がホストする LLM を使用します。ジャッジ定義内のmodel引数を使用して、ジャッジモデルを変更できます。モデルは<provider>:/<model-name>形式で指定する必要があります。ここで、 <provider>は LiteLLM 互換のモデルプロバイダーです。モデルプロバイダーとしてdatabricksを使用する場合、モデル名はサービスエンドポイント名と同じになります。

異なるジャッジモデルを提供することでジャッジをカスタマイズできます。

Python
from mlflow.genai.scorers import RetrievalSufficiency

# Use a different judge model
sufficiency_judge = RetrievalSufficiency(
    model="databricks:/databricks-gpt-5-mini"  # Or any LiteLLM-compatible model
)

# Use in evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[sufficiency_judge]
)

サポートされているモデルの一覧については、 MLflow のドキュメントを参照してください。

結果を解釈する

ジャッジは、次の Feedback オブジェクトを返します。

value : コンテキストが十分であれば "yes"、不十分な場合は "no"
rationale : コンテキストがカバーしている、または欠けている予想される事実の説明

次のステップ

コンテキストの関連性を評価する - 十分性を確認する前に、取得したドキュメントが関連性があることを確認します
接地性の評価 - 応答が指定されたコンテキストのみを使用していることを確認します
評価データセットの構築 - テストのために予想される事実を含むグラウンドトゥルースデータセットを作成します

例を実行するための前提条件​

使用例​

RAGの例​

結果を理解する​

裁判官の力となるLLMを選択する​

結果を解釈する​

次のステップ​