評価セット (MLflow 2)

備考

MLflow 2

このページでは、MLflow 2 でのエージェント評価版 0.22 の使用方法について説明します。Databricks では、Agent Evaluation と統合された MLflow 3 の使用 >1.0推奨しています。MLflow 3 では、エージェント評価APIsが mlflow パッケージの一部になりました。

このトピックに関する情報については、「 Building MLflow Evaluation データセット」を参照してください。

AIエージェントの品質を測定するには、代表的なリクエストのセットと、高品質のレスポンスを特徴付ける基準を定義できる必要があります。これを行うには、評価セットを提供します。この記事では、評価セットのさまざまなオプションと、評価セットを作成するためのいくつかのベストプラクティスについて説明します。

Databricks では、代表的な質問とグラウンドトゥルースの回答で構成される、人間がラベル付けした評価セットを作成することをお勧めします。アプリケーションに取得ステップが含まれている場合は、応答の基になるサポートドキュメントをオプションで提供できます。評価セットの作成を開始するために、Databricks には、エージェント評価で直接使用したり、対象分野の専門家に送信してレビューしたりできる高品質の合成質問とグラウンドトゥルース回答を生成する SDK が用意されています。評価セットの合成を参照してください。

適切な評価セットには、次の特性があります。

代表的セット：アプリケーションが本番運用で遭遇するリクエストの範囲を正確に反映する必要があります。
挑戦的:アプリケーションの機能の全範囲を効果的にテストするために、困難で多様なケースを含める必要があります。
継続的更新セット：アプリケーションの使用方法と変化する運用トラフィックのパターンを反映するために、定期的に更新する必要があります。

評価セットの必要なスキーマについては、「エージェント評価の入力スキーマ (MLflow 2)」を参照してください。

サンプル評価セット

このセクションでは、評価セットの簡単な例を示します。

サンプル評価セットのみ `request`

Python
eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
    }
]

`request`と `expected_response`

Python
eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
    }
]

`request`、`expected_response`、および `expected_retrieved_content`

Python
eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_1",
            },
            {
                "doc_uri": "doc_uri_2",
            },
        ],
        "expected_response": "There's no significant difference.",
    }
]

`request`と `response`

Python
eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
    }
]

任意のフォーマットの `request` を持つサンプル評価セット `response`

Python
eval_set = [
    {
        "request": {"query": "Difference between", "item_a": "reduceByKey", "item_b": "groupByKey"},
        "response": {
            "differences": [
                "reduceByKey aggregates data before shuffling",
                "groupByKey shuffles all data",
                "reduceByKey is more efficient",
            ]
        }
    }
]

`request`、`response`、および `guidelines`

Python
eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        # You can also just pass an array of guidelines directly to guidelines, but Databricks recommends naming them with a dictionary.
        "guidelines": {
            "english": ["The response must be in English"],
            "clarity": ["The response must be clear, coherent, and concise"],
        }
    }
]

`request`、`response`、`guidelines`、および `expected_facts`

Python
eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "expected_facts": [
            "There's no significant difference.",
        ],
        # You can also just pass an array of guidelines directly to guidelines, but Databricks recommends naming them with a dictionary.
        "guidelines": {
            "english": ["The response must be in English"],
            "clarity": ["The response must be clear, coherent, and concise"],
        }
    }
]

`request`、`response`、および `retrieved_context`

Python
eval_set = [
    {
        "request_id": "request-id", # optional, but useful for tracking
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

`request`、`response`、`retrieved_context`、および `expected_facts`

Python
eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_facts": [
            "There's no significant difference.",
        ],
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

`request`、`response`、`retrieved_context`、`expected_facts`、および `expected_retrieved_context`

Python
eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_2_1",
            },
            {
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_facts": [
            "There's no significant difference.",
        ],
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

評価セットを作成するためのベストプラクティス

評価セット内の各サンプルまたはサンプルのグループを単体テストとして検討します。つまり、各サンプルは、明示的に期待される結果を持つ特定のシナリオに対応している必要があります。たとえば、より長い文脈、マルチホップの推論、間接的な証拠から答えを推測する能力をテストすることを検討してください。
悪意のあるユーザーによる敵対的なシナリオのテストを検討してください。
評価セットに含める質問の数に関する具体的なガイドラインはありません。通常、高品質のデータからの明確な信号は、弱いデータからのノイズの多い信号よりもパフォーマンスが優れています。
人間にとっても答えるのが非常に難しい例を含めることを検討してください。
汎用のアプリケーションを構築しているのか、特定のドメインをターゲットにしているのかに関わらず、アプリにはさまざまな質問が投げかけられます。評価セットには、それを反映させる必要があります。たとえば、人事に関する特定の質問に回答するアプリケーションを作成する場合でも、アプリケーションが困惑したり、有害な反応を示したりしないように、他のドメイン（業務など）をテストすることを検討する必要があります。
高品質で一貫性のある人間が生成したラベルは、アプリケーションに提供するグラウンドトゥルース値が目的の動作を正確に反映していることを確認するための最良の方法です。高品質のヒューマンラベルを確保するためのいくつかの手順は次のとおりです。
- 同じ質問に対する複数のラベル付け担当者からの回答（ラベル）を集約します。
- ラベル付けの指示が明確であり、ラベル付け担当者が一貫していることを確認します。
- 人間によるラベリングプロセスの条件が、RAG アプリケーションに提出された要求の形式と同じであることを確認します。
人間のラベル付け者は、質問の解釈が異なるなど、本質的にノイズが多く一貫性がありません。これはプロセスの重要な部分です。ヒューマンラベリングを使用すると、考えていなかった質問の解釈が明らかになり、申請時に観察した行動についての知見が得られる場合があります。

サンプル評価セット​

サンプル評価セットのみ request​

requestと expected_response​

request、expected_response、および expected_retrieved_content​

requestと response​

任意のフォーマットの request を持つサンプル評価セット response​

request、response、および guidelines​

request、response、guidelines、および expected_facts​

request、response、および retrieved_context​

request、response、retrieved_context、および expected_facts​

request、response、retrieved_context、expected_facts、および expected_retrieved_context​

評価セットを作成するためのベスト プラクティス​