カスタムスコアラーの作成

概要

カスタムスコアラーは、生成AIアプリケーションの品質を測定する方法を正確に定義するための究極の柔軟性を提供します。カスタムスコアラーは、単純なヒューリスティック、高度なロジック、またはプログラムによる評価に基づくかどうかにかかわらず、特定のビジネスユースケースに合わせた評価メトリクスを柔軟に定義できます。

カスタムスコアラーは、次のシナリオで使用します。

カスタム評価メトリクスまたはコードベースの評価メトリクスの定義
アプリのトレースからのデータを、定義済みのLLM スコアラーで Databricksの研究に裏付けられた LLM ジャッジにマップする方法をカスタマイズする
プロンプトベースのLLMスコアラーの記事を使用して、カスタムプロンプトテキストでLLMジャッジを作成する。
評価に (Databricks でホストされている LLM ジャッジモデルではなく) 独自の LLM モデルを使用する
事前定義された抽象化によって提供されるよりも高い柔軟性と制御が必要なその他のユースケース

注記

カスタムスコアラーインターフェースの詳細なリファレンスについては、スコアラーのコンセプトページまたは API ドキュメントを参照してください。

使用方法

カスタムスコアラーはPythonで記述されており、アプリのトレースから任意のデータを評価するための完全な制御を提供します。1 人のカスタムスコアラーが、オフライン評価のために、または本番運用モニタリングのためにに渡された場合、両方のevaluate(...) ハーネス create_monitor(...)で機能します。

次の出力タイプがサポートされています。

合格/不合格文字列: "yes" or "no" 文字列値は、UI で "Pass" または "Fail" としてレンダリングされます。
数値: 序数: 整数または浮動小数点数。
Boolean 値: True または False。
フィードバックオブジェクト: スコア、根拠、追加のメタデータを含む Feedback オブジェクトを返します

入力として、カスタムスコアラーは以下にアクセスできます。

スパン、属性、出力を含む完全な MLflow トレース。トレースは、インスタンス化された mlflow.entities.trace クラスとしてカスタムスコアラーに渡されます。
inputs ディクショナリは、トレースの入力データセットまたは MLflow の後処理から派生します。
入力データセットまたはトレースから取得された outputs 値。predict_fnが指定されている場合、outputs値はpredict_fnの戻り値になります。
入力データセットの expectations フィールドから取得された expectations ディクショナリ、またはトレースに関連付けられた評価。

@scorerデコレーターを使用すると、ユーザーはmlflow.genai.evaluate() scorers引数またはcreate_monitor(...) を使用してに渡すことができるカスタム評価メトリクスを定義できます。

scorer 関数は、以下のシグネチャに基づく名前付き引数で呼び出されます。すべての名前付き引数はオプションであるため、任意の組み合わせを使用できます。たとえば、引数として inputs と trace のみを持つスコアラーを定義し、 outputs と expectationsを省略できます。

Python
from mlflow.genai.scorers import scorer
from typing import Optional, Any
from mlflow.entities import Feedback

@scorer
def my_custom_scorer(
  *,  # evaluate(...) harness will always call your scorer with named arguments
  inputs: Optional[dict[str, Any]],  # The agent's raw input, parsed from the Trace or dataset, as a Python dict
  outputs: Optional[Any],  # The agent's raw output, parsed from the Trace or
  expectations: Optional[dict[str, Any]],  # The expectations passed to evaluate(data=...), as a Python dict
  trace: Optional[mlflow.entities.Trace] # The app's resulting Trace containing spans and other metadata
) -> int | float | bool | str | Feedback | list[Feedback]

カスタムスコアラー開発アプローチ

メトリクスを開発するときは、スコアラーに変更を加えるたびにアプリを実行することなく、メトリクスを迅速に反復処理する必要があります。これを行うには、次の手順をお勧めします。

ステップ 1: 初期メトリクス、アプリ、評価データを定義する

Python
import mlflow
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
from typing import Any

@mlflow.trace
def my_app(input_field_name: str):
    return {'output': input_field_name+'_output'}

@scorer
def my_metric() -> int:
    # placeholder return value
    return 1

eval_set = [{'inputs': {'input_field_name': 'test'}}]

ステップ 2: アプリからトレースを生成する `evaluate()`

Python
eval_results = mlflow.genai.evaluate(
    data=eval_set,
    predict_fn=my_app,
    scorers=[dummy_metric]
)

ステップ 3: 結果のトレースをクエリして保存する

Python
generated_traces = mlflow.search_traces(run_id=eval_results.run_id)

ステップ 4: メトリクスを反復処理する際に、結果のトレースを入力として `evaluate()` に渡します

search_traces 関数は、入力データセットとして evaluate()Pandas に直接渡すことができるトレースの Pandas DataFrame を返します。これにより、アプリを再実行することなく、メトリクスをすばやく反復処理できます。

Python
@scorer
def my_metric(outputs: Any):
    # Implement the actual metric logic here.
    return outputs == "test_output"

# Note the lack of a predict_fn parameter
mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[my_metric],
)

カスタムスコアラーの例

このガイドでは、カスタムスコアラーを構築するためのさまざまなアプローチを紹介します。

カスタムスコアラー開発

前提条件: サンプルアプリケーションを作成し、トレースのローカルコピーを取得する

すべてのアプローチで、以下のサンプルアプリケーションとトレースのコピー ( 上記のアプローチを使用して抽出) を使用します。

Python
import mlflow
from openai import OpenAI
from typing import Any
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer

# Enable auto logging for OpenAI
mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
    # 1. Prepare messages for the LLM
    messages_for_llm = [
        {"role": "system", "content": "You are a helpful assistant."},
        *messages,
    ]

    # 2. Call LLM to generate a response
    response = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=messages_for_llm,
    )
    return response.choices[0].message.content


# Create a list of messages for the LLM to generate a response
eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ]
        },
    },
]


@scorer
def dummy_metric():
    # This scorer is just to help generate initial traces.
    return 1


# Generate initial traces by running the sample_app.
# The results, including traces, are logged to the MLflow experiment defined above.
initial_eval_results = mlflow.genai.evaluate(
    data=eval_dataset, predict_fn=sample_app, scorers=[dummy_metric]
)

generated_traces = mlflow.search_traces(run_id=initial_eval_results.run_id)

上記のコードを実行すると、エクスペリメントに 3 つのトレースが作成されます。

生成されたサンプルトレース

例 1: トレースからのデータへのアクセス

完全な MLflow Traceオブジェクトにアクセスして、さまざまな詳細(スパン、入力、出力、属性、タイミング)を使用して、きめ細かなメトリクス計算を行うことができます。

注記

前提条件セクションの generated_traces は、これらの例の入力データとして使用されます。

このスコアラーは、トレースの合計実行時間が許容範囲内にあるかどうかを確認します。

Python
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType

@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
    # Search particular span type from the trace
    llm_span = trace.search_spans(span_type=SpanType.CHAT_MODEL)[0]

    response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # second
    max_duration = 5.0
    if response_time <= max_duration:
        return Feedback(
            value="yes",
            rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
        )
    else:
        return Feedback(
            value="no",
            rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
        )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[llm_response_time_good]
)

例 2: 事前定義の LLM ジャッジのラッピング

MLflow の定義済みの LLM ジャッジをラップするカスタムスコアラーを作成します。これを使用して、ジャッジのトレースデータを前処理したり、フィードバックを後処理したりします。

この例では、特定のコンテキストがクエリに関連しているかどうかを評価する is_context_relevant ジャッジをラップして、アシスタントの応答がユーザーのクエリに関連しているかどうかを評価する方法を示します。

Python
import mlflow
from mlflow.entities import Trace, Feedback
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Any

# Assume `generated_traces` is available from the prerequisite code block.

@scorer
def is_message_relevant(inputs: dict[str, Any], outputs: str) -> Feedback:
    # The `inputs` field for `sample_app` is a dictionary like: {"messages": [{"role": ..., "content": ...}, ...]}
    # We need to extract the content of the last user message to pass to the relevance judge.

    last_user_message_content = None
    if "messages" in inputs and isinstance(inputs["messages"], list):
        for message in reversed(inputs["messages"]):
            if message.get("role") == "user" and "content" in message:
                last_user_message_content = message["content"]
                break

    if not last_user_message_content:
        raise Exception("Could not extract the last user message from inputs to evaluate relevance.")

    # Call the `relevance_to_query judge. It will return a Feedback object.
    return is_context_relevant(
        request=last_user_message_content,
        context={"response": outputs},
    )

# Evaluate the custom relevance scorer
custom_relevance_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[is_message_relevant]
)

例 3: 使用 `expectations`

ディクショナリのリストまたは Pandas DataFrame である data 引数を使用して mlflow.genai.evaluate() を呼び出すと、各行に expectations キーを含めることができます。このキーに関連付けられた値は、カスタムスコアラーに直接渡されます。

Python
import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer
from typing import Any, List, Optional, Union

expectations_eval_dataset_list = [
    {
        "inputs": {"messages": [{"role": "user", "content": "What is 2+2?"}]},
        "expectations": {
            "expected_response": "2+2 equals 4.",
            "expected_keywords": ["4", "four", "equals"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Describe MLflow in one sentence."}]},
        "expectations": {
            "expected_response": "MLflow is an open-source platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.",
            "expected_keywords": ["mlflow", "open-source", "platform", "machine learning"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Say hello."}]},
        "expectations": {
            "expected_response": "Hello there!",
            # No keywords needed for this one, but the field can be omitted or empty
        }
    }
]

例 3.1: 予期される応答との完全一致

このスコアラーは、アシスタントの応答がexpectationsに記載されているexpected_responseと完全に一致するかどうかを確認します。

Python
@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
    # Scorer can return primitive value like bool, int, float, str, etc.
    return outputs == expectations["expected_response"]

exact_match_eval_results = mlflow.genai.evaluate(
    data=expectations_eval_dataset_list,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[exact_match]
)

例 3.2: 期待値からのキーワードプレゼンスチェック

このスコアラーは、expectationsからのすべてのexpected_keywordsがアシスタントの応答に存在するかどうかを確認します。

Python
@scorer
def keyword_presence_scorer(outputs: str, expectations: dict[str, Any]) -> Feedback:
    expected_keywords = expectations.get("expected_keywords")
    print(expected_keywords)
    if expected_keywords is None:
        return Feedback(
            score=None, # Undetermined, as no keywords were expected
            rationale="No 'expected_keywords' provided in expectations."
        )

    missing_keywords = []
    for keyword in expected_keywords:
        if keyword.lower() not in outputs.lower():
            missing_keywords.append(keyword)

    if not missing_keywords:
        return Feedback(value="yes", rationale="All expected keywords are present in the response.")
    else:
        return Feedback(value="no", rationale=f"Missing keywords: {', '.join(missing_keywords)}.")

keyword_presence_eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[keyword_presence_scorer]
)

例 4: 複数のフィードバックオブジェクトを返す

1人のスコアラーが Feedback オブジェクトのリストを返すことができるため、1人のスコアラーが複数の品質ファセット(PII、センチメント、簡潔性など)を同時に評価できます。各 Feedback オブジェクトには、理想的には一意の name が必要です(これが結果のメトリクス名になります)。そうしないと、名前が自動生成されて衝突した場合に、互いに上書きされる可能性があります。名前が指定されていない場合、MLflow はスコアラー関数名とインデックスに基づいて名前の生成を試みます。

この例では、トレースごとに 2 つの異なるフィードバックを返すスコアラーを示しています。

is_not_empty_check: 応答内容が空でないかどうかを示すブール値。
response_char_length: 応答の文字長の数値。

Python
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace # Ensure Feedback and Trace are imported
from typing import Any, Optional

# Assume `generated_traces` is available from the prerequisite code block.

@scorer
def comprehensive_response_checker(outputs: str) -> list[Feedback]:
    feedbacks = []
    # 1. Check if the response is not empty
    feedbacks.append(
        Feedback(name="is_not_empty_check", value="yes" if outputs != "" else "no")
    )
    # 2. Calculate response character length
    char_length = len(outputs)
    feedbacks.append(Feedback(name="response_char_length", value=char_length))
    return feedbacks

multi_feedback_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[comprehensive_response_checker]
)

結果には、評価として is_not_empty_check と response_char_length の 2 つの列があります。

マルチフィードバック結果

例 5: 裁判官に独自の LLM を使用する

カスタムまたは外部でホストされているLLMをスコアラーに統合します。スコアラーは、API呼び出し、入出力フォーマットを処理し、LLMの応答から Feedback を生成し、審査プロセスを完全に制御できるようにします。

また、Feedbackオブジェクトの source フィールドを設定して、評価のソースが LLM ジャッジであることを示すこともできます。

Python
import mlflow
import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback
from typing import Any, Optional


# Assume `generated_traces` is available from the prerequisite code block.
# Assume `client` (OpenAI SDK client configured for Databricks) is available from the prerequisite block.
# client = OpenAI(...)

# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.

Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.

Original User Query:
```{user_query}```

AI's Response:
```{llm_response_from_app}```

Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""

@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
    user_query = inputs["messages"][-1]["content"]

    # Call the Judge LLM using the OpenAI SDK client.
    judge_llm_response_obj = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o-mini, etc.
        messages=[
            {"role": "system", "content": judge_system_prompt},
            {"role": "user", "content": judge_user_prompt.format(user_query=user_query, llm_response_from_app=outputs)},
        ],
        max_tokens=200,  # Max tokens for the judge's rationale
        temperature=0.0, # For more deterministic judging
    )
    judge_llm_output_text = judge_llm_response_obj.choices[0].message.content

    # Parse the Judge LLM's JSON output.
    judge_eval_json = json.loads(judge_llm_output_text)
    parsed_score = int(judge_eval_json["score"])
    parsed_rationale = judge_eval_json["rationale"]

    return Feedback(
        value=parsed_score,
        rationale=parsed_rationale,
        # Set the source of the assessment to indicate the LLM judge used to generate the feedback
        source=AssessmentSource(
            source_type=AssessmentSourceType.LLM_JUDGE,
            source_id="claude-3-7-sonnet",
        )
    )


# Evaluate the scorer using the pre-generated traces.
custom_llm_judge_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[answer_quality]
)

UIでトレースを開き、「answer_quality」評価をクリックすると、根拠、タイムスタンプ、ジャッジモデル名など、ジャッジのメタデータを確認できます。ジャッジの評価が正しくない場合は、 Edit ボタンをクリックしてスコアを無効にすることができます。

新しい評価は元の審査員の評価に優先しますが、編集履歴は将来の参照のために保存されます。

LLMジャッジ評価の編集

次のステップ

これらの推奨アクションとチュートリアルで旅を続けてください。

カスタム LLM スコアラーによる評価 - LLM を使用してセマンティック評価を作成します
本番運用における実行 scorers - 継続的なモニタリングのための score rs のデプロイ
評価データセットの構築 - 採点者用のテストデータを作成します

リファレンスガイド

このガイドで説明されている概念と機能の詳細なドキュメントをご覧ください。

スコアラー - スコアラーの仕組みとそのアーキテクチャについて詳しく説明します
評価ハーネス - mlflow.genai.evaluate() がスコアラーをどのように使用しているかを理解する
LLM 審査員 - AI を活用した評価の基礎を学ぶ

概要​

使用方法​

カスタムスコアラー開発アプローチ​

ステップ 1: 初期メトリクス、アプリ、評価データを定義する​

ステップ 2: アプリからトレースを生成する evaluate()​

ステップ 3: 結果のトレースをクエリして保存する​

ステップ 4: メトリクスを反復処理する際に、結果のトレースを入力として evaluate() に渡します​

カスタムスコアラーの例​

前提条件: サンプル アプリケーションを作成し、トレースのローカル コピーを取得する​

例 1: トレースからのデータへのアクセス​

例 2: 事前定義の LLM ジャッジのラッピング​

例 3: 使用 expectations​

例 4: 複数のフィードバック オブジェクトを返す​

例 5: 裁判官に独自の LLM を使用する​

次のステップ​

リファレンスガイド​

概要