グラウンディングジャッジ&スコアラー

judges.is_grounded()事前定義されたジャッジは、アプリケーションのレスポンスが提供されたコンテキスト(RAGシステムからのものか、ツールコールによって生成されたもの)によって事実上サポートされているかどうかを評価し、そのコンテキストに裏打ちされていない幻覚やステートメントの検出を支援します。

このジャッジは、取得した情報に基づいて応答する必要があるRAGアプリケーションを評価するために、事前定義された RetrievalGroundedness スコアラーを通じて利用できます。

API シグネチャ

Python
from mlflow.genai.judges import is_grounded

def is_grounded(
    *,
    request: str,               # User's original query
    response: str,              # Application's response
    context: Any,               # Context to evaluate for relevance, can be any Python primitive or a JSON-seralizable dict
    name: Optional[str] = None  # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
    """Returns Feedback with 'yes' or 'no' value and a rationale"""

例を実行するための前提条件

MLflow と必要なパッケージをインストールする
Bash
```
pip install --upgrade "mlflow[databricks]>=3.1.0"
```
MLflow エクスペリメントを作成するには、環境のセットアップに関するクイックスタートに従ってください。

SDKの直接使用

Python
from mlflow.genai.judges import is_grounded

# Example 1: Response is grounded in context
feedback = is_grounded(
    request="What is the capital of France?",
    response="Paris",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."}
    ]
)
print(feedback.value)  # "yes"
print(feedback.rationale)  # Explanation of groundedness

# Example 2: Response contains hallucination
feedback = is_grounded(
    request="What is the capital of France?",
    response="Paris, which has a population of 10 million people",
    context=[
        {"content": "Paris is the capital of France."}
    ]
)
print(feedback.value)  # "no"
print(feedback.rationale)  # Identifies unsupported claim about population

事前構築済みのスコアラーを使用する

is_groundedジャッジは、RetrievalGroundednessの事前構築済みスコアラーを通じて利用できます。

要件：

トレース要件 :
- MLflow トレースには、 span_type が 1 に設定されたスパンが少なくとも 1 つ含まれている必要があります。 RETRIEVER
- inputs また、 outputs トレースのルートスパン上にある必要があります

Python
import os
import mlflow
from openai import OpenAI
from mlflow.genai.scorers import RetrievalGroundedness
from mlflow.entities import Document
from typing import List

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
  api_key=cred.token,
  base_url=f"{cred.host}/serving-endpoints"
)

# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
    # Simulated retrieval based on query
    if "mlflow" in query.lower():
        return [
            Document(
                id="doc_1",
                page_content="MLflow is an open-source platform for managing the ML lifecycle.",
                metadata={"source": "mlflow_docs.txt"}
            ),
            Document(
                id="doc_2",
                page_content="MLflow provides tools for experiment tracking, model packaging, and deployment.",
                metadata={"source": "mlflow_features.txt"}
            )
        ]
    else:
        return [
            Document(
                id="doc_3",
                page_content="Machine learning involves training models on data.",
                metadata={"source": "ml_basics.txt"}
            )
        ]

# Define your RAG app
@mlflow.trace
def rag_app(query: str):
    # Retrieve relevant documents
    docs = retrieve_docs(query)
    context = "\n".join([doc.page_content for doc in docs])

    # Generate response using LLM
    messages = [
        {"role": "system", "content": f"Answer based on this context: {context}"},
        {"role": "user", "content": query}
    ]

    response = client.chat.completions.create(
        # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        model="databricks-claude-3-7-sonnet",
        messages=messages
    )

    return {"response": response.choices[0].message.content}

# Create evaluation dataset
eval_dataset = [
    {
        "inputs": {"query": "What is MLflow used for?"}
    },
    {
        "inputs": {"query": "What are the main features of MLflow?"}
    }
]

# Run evaluation with RetrievalGroundedness scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[RetrievalGroundedness()]
)

カスタムスコアラーでの使用

事前定義されたスコアラーの要件とは異なるデータ構造を持つアプリケーションを評価する場合は、ジャッジをカスタムスコアラーで包みます。

Python
from mlflow.genai.judges import is_grounded
from mlflow.genai.scorers import scorer
from typing import Dict, Any

eval_dataset = [
    {
        "inputs": {"query": "What is MLflow used for?"},
        "outputs": {
            "response": "MLflow is used for managing the ML lifecycle, including experiment tracking and model deployment.",
            "retrieved_context": [
                {"content": "MLflow is a platform for managing the ML lifecycle."},
                {"content": "MLflow includes capabilities for experiment tracking, model packaging, and deployment."}
            ]
        }
    },
    {
        "inputs": {"query": "Who created MLflow?"},
        "outputs": {
            "response": "MLflow was created by Databricks in 2018 and has over 10,000 contributors.",
            "retrieved_context": [
                {"content": "MLflow was created by Databricks."},
                {"content": "MLflow was open-sourced in 2018."}
            ]
        }
    }
]

@scorer
def groundedness_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    return is_grounded(
        request=inputs["query"],
        response=outputs["response"],
        context=outputs["retrieved_context"]
    )

# Run evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[groundedness_scorer]
)

結果の解釈

ジャッジは、次の Feedback オブジェクトを返します。

value :応答が根拠がある場合は「はい」、幻覚が含まれている場合は「いいえ」
rationale :以下を特定する詳細な説明:
- コンテキストでサポートされているステートメント
- サポートが不足しているステートメント(幻覚)
- 主張を支持または否定する文脈からの特定の引用

次のステップ

コンテキストの十分性を評価する - レトリーバーが適切な情報を提供しているかどうかを確認します
コンテキストの関連性を評価する - 取得したドキュメントがクエリに関連していることを確認します
包括的なRAG評価を実行 - 複数のジャッジを組み合わせて完全なRAG評価を実施

API シグネチャ​

例を実行するための前提条件​

SDKの直接使用​

事前構築済みのスコアラーを使用する​

カスタムスコアラーでの使用​

結果の解釈​

次のステップ​