カスタムメトリクス (MLflow 2)

重要

Databricks では、GenAI アプリの評価とモニタリングに MLflow 3 の使用を推奨しています。このページでは、MLflow 2 の Agent Evaluation について説明します。

MLflow 3での評価とモニタリングの概要については、AIエージェントの評価とモニタリングを参照してください。
MLflow 3への移行に関する情報については、Agent EvaluationからMLflow 3への移行を参照してください。
このトピックに関するMLflow 3情報については、コードベースのスコアラーを参照してください。

このガイドでは、カスタムエージェントを使用してAIアプリケーションを評価するためのカスタムメトリクスの使用方法について説明します。カスタムメトリクスは、特定のビジネスユースケースに合わせた評価メトリクスを柔軟に定義できます。それらは、単純なヒューリスティック、高度なロジック、またはプログラムによる評価に基づいて作成できます。

概要

カスタムメトリクスは Python で記述され、開発者は AI アプリケーションを通じてトレースを評価する完全な制御権を持ちます。次のメトリクスがサポートされています:

合否メトリクス："yes" or "no"個の文字列値は、UIで「合格」または「不合格」として表示されます。
「数値メトリクス」：序数値：整数または浮動小数点数。
Booleanメトリクス：TrueまたはFalse。

カスタムメトリクスでは次のものを使用できます。

評価行の任意のフィールド。
追加のエクスペクテーションの custom_expected フィールド。
スパン、属性、出力を含むMLflowトレースへの完全なアクセス。

使い方

カスタムメトリクスは、mlflow.evaluate() の extra_metrics フィールドを使用して評価フレームワークに渡されます。例：

Python
import mlflow
from databricks.agents.evals import metric

@metric
def not_empty(response):
    # "yes" for Pass and "no" for Fail.
    return "yes" if response.choices[0]['message']['content'].strip() != "" else "no"

@mlflow.trace(span_type="CHAT_MODEL")
def my_model(request):
    deploy_client = mlflow.deployments.get_deploy_client("databricks")
    return deploy_client.predict(
        endpoint="databricks-meta-llama-3-3-70b-instruct", inputs=request
    )

with mlflow.start_run(run_name="example_run"):
    eval_results = mlflow.evaluate(
        data=[{"request": "Good morning"}],
        model=my_model,
        model_type="databricks-agent",
        extra_metrics=[not_empty],
    )
    display(eval_results.tables["eval_results"])

`@metric` デコレーター

@metric デコレーターを使用すると、ユーザーは mlflow.evaluate() に渡すことができるカスタム評価メトリクスを定義できます。extra_metrics引数を使用します。評価ハーネスは、以下のシグネチャに基づいて、名前付き引数でメトリクス関数を呼び出します。

Python
def my_metric(
  *,  # eval harness will always call it with named arguments
  request: Dict[str, Any],  # The agent's raw input as a serializable object
  response: Optional[Dict[str, Any]],  # The agent's raw output; directly passed from the eval harness
  retrieved_context: Optional[List[Dict[str, str]]],  # Retrieved context, either from input eval data or extracted from the trace
  expected_response: Optional[str],  # The expected output as defined in the evaluation dataset
  expected_facts: Optional[List[str]],  # A list of expected facts that can be compared against the output
  guidelines: Optional[Union[List[str], Dict[str, List[str]]]]  # A list of guidelines or mapping a name of guideline to an array of guidelines for that name
  expected_retrieved_context: Optional[List[Dict[str, str]]],  # Expected context for retrieval tasks
  trace: Optional[mlflow.entities.Trace],  # The trace object containing spans and other metadata
  custom_expected: Optional[Dict[str, Any]],  # A user-defined dictionary of extra expected values
  tool_calls: Optional[List[ToolCallInvocation]],
) -> float | bool | str | Assessment

引数の説明

request ：エージェントに提供される入力で、任意のシリアル化可能なオブジェクトとしてフォーマットされます。これは、ユーザークエリまたはプロンプトを表します。
response ：エージェントからの生の出力。オプションの任意のシリアル化可能なオブジェクトとしてフォーマットされています。エージェントが生成した評価用応答が含まれています。
retrieved_context ** **：タスク中に取得されたコンテキストを含むディクショナリーのリスト。このコンテキストは、入力評価データセットまたはトレースから取得でき、ユーザーはtraceフィールドを介してその抽出をオーバーライドまたはカスタマイズできます。
expected_response ** **: タスクに対する正しい応答または望ましい応答を表す文字列です。これは、エージェントの応答と比較するための正解として機能します。
expected_facts ：エージェントの応答に表示されることが期待される事実のリストで、ファクトチェックタスクに役立ちます。
guidelines ガイドライン：ガイドラインのリスト、またはガイドライン名をそのガイドラインの配列にマッピングしたもの。ガイドラインでは、任意のフィールドに制約を設けることが可能であり、その制約はガイドライン準拠審査員によって評価されます。
expected_retrieved_context ：期待される取得コンテキストを表すディクショナリーのリスト。これは、取得されたデータの正確性が重要となる、検索拡張タスクに不可欠です。
trace ** **: スパン、属性、およびエージェントの実行に関するその他のメタデータを含むオプションのMLflowTrace オブジェクトです。これにより、エージェントが実行する内部ステップを詳細に検査できます。
custom_expected : ユーザー定義のエクスペクテーションを渡すためのディクショナリ。このフィールドでは、標準フィールドに含まれていない追加のカスタムエクスペクテーションを柔軟に含めることができます。
tool_calls ：呼び出されたツールとそれらが返したものを記述するToolCallInvocationのリスト。

戻り値

カスタムメトリクスの戻り値は、行ごとの評価です。プリミティブを返した場合、それは空の根拠とともにAssessmentでラップされます。

float : 数値メトリクス (類似度スコア、精度パーセンテージなど) の場合。
「 bool 」：バイナリメトリクス用です。
Assessment またはlist[Assessment]：根拠の追加をサポートする、よりリッチな出力タイプです。アセスメントのリストを返した場合、同じメトリクス関数を再利用して複数のアセスメントを返すことができます。
- name評価の名前。
- value：値（float、int、bool、または文字列）。
- rationale：（任意）この値がどのようにコンピュートされたかを説明する根拠。これは、UIに追加の根拠を表示するのに役立ちます。このフィールドは、たとえば、この評価を生成したLLMからの根拠を提供する場合に役立ちます。

合格/不合格メトリクス

"yes"と"no"を返す文字列メトリクスはすべて合否メトリクスとして扱われ、UIで特別な処理がなされます。

callable judge Python SDKを使用して、パス/フェイルメトリクスを作成することもできます。これにより、トレースのどの部分を評価するか、およびどの期待フィールドを使用するかをより詳細に制御できます。組み込みの Agent Evaluation ジャッジのいずれかを使用できます。組み込み AI ジャッジ（MLflow 2）を参照してください。

取得したコンテキストにPIIが含まれていないことを確認してください。

この例では、取得されたコンテキストに PII が含まれていないことを保証するために、guideline_adherence ジャッジを呼び出します。

Python
import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!",
    "retrieved_context": [{
      "content": "The email address is noreply@databricks.com",
    }],
  }, {
    "request": "Good afternoon",
    "response": "This is actually the morning!",
    "retrieved_context": [{
      "content": "fake retrieved context",
    }],
  }
]

@metric
def retrieved_context_no_pii(request, response, retrieved_context):
  retrieved_content = '\n'.join([c['content'] for c in retrieved_context])
  return judges.guideline_adherence(
    request=request,
    # You can also pass in per-row guidelines by adding `guidelines` to the signature of your metric
    guidelines=[
      "The retrieved context must not contain personally identifiable information.",
    ],
    # `guidelines_context` requires `databricks-agents>=0.20.0`
    guidelines_context={&quot;retrieved_context&quot;: retrieved_content},
  )

with mlflow.start_run(run_name="safety"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[retrieved_context_no_pii],
        # Disable built-in judges.
        evaluator_config={
            'databricks-agent': {
                &quot;metrics&quot;: [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

数値メトリクス

数値メトリクスは、浮動小数点数や整数などの序数を評価します。 UI には、数値メトリクスが行ごとに表示され、評価ランの平均値も表示されます。

例：応答の類似性

この**メトリクス**は、**組み込み**の**Python** **ライブラリ**SequenceMatcherを使用して、responseとexpected_response間の類似性を測定します。

Python
import mlflow
import pandas as pd
from databricks.agents.evals import metric
from difflib import SequenceMatcher

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!",
    "expected_response": "Hello and good morning to you!"
  }, {
    "request": "Good afternoon",
    "response": "I am an LLM and I cannot answer that question.",
    "expected_response": "Good afternoon to you too!"
  }
]

@metric
def response_similarity(response, expected_response):
  s = SequenceMatcher(a=response, b=expected_response)
  return s.ratio()

with mlflow.start_run(run_name="response_similarity"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_similarity],
        evaluator_config={
            'databricks-agent': {
                &quot;metrics&quot;: [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

Boolean メトリクス

BooleanメトリクスはTrueまたはFalseに評価されます。これらは、応答が単純なヒューリスティックを満たしているかどうかの確認など、二値の決定に役立ちます。UIでそのメトリクスに特別なパス/フェイル処理を行いたい場合は、「パス/フェイルメトリクス」を参照してください。

例：入力リクエストが正しくフォーマットされているか確認します。

このメトリクスは、任意の入力が期待どおりにフォーマットされているかどうかをチェックし、フォーマットされていれば True を返します。

Python
import mlflow
import pandas as pd
from databricks.agents.evals import metric

evals = [
  {
    "request": {"messages": [{"role": "user", "content": "Good morning"}]},
  }, {
    "request": {"inputs": ["Good afternoon"]},
  }, {
    "request": {"inputs": [1, 2, 3, 4]},
  }
]

@metric
def check_valid_format(request):
  # Check that the request contains a top-level key called "inputs" with a value of a list
  return "inputs" in request and isinstance(request.get("inputs"), list)

with mlflow.start_run(run_name="check_format"):
  eval_results = mlflow.evaluate(
      data=pd.DataFrame.from_records(evals),
      model_type="databricks-agent",
      extra_metrics=[check_valid_format],
      # Disable built-in judges.
      evaluator_config={
          'databricks-agent': {
              &quot;metrics&quot;: [],
          }
      }
  )
eval_results.tables['eval_results']

例: 言語モデルの自己参照

このメトリクスは、応答に「LLM」が言及されているかどうかを確認し、言及されている場合はTrueを返します。

Python
import mlflow
import pandas as pd
from databricks.agents.evals import metric

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!"
  }, {
    "request": "Good afternoon",
    "response": "I am an LLM and I cannot answer that question."
  }
]

@metric
def response_mentions_llm(response):
  return "LLM" in response

with mlflow.start_run(run_name="response_mentions_llm"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_mentions_llm],
        evaluator_config={
            'databricks-agent': {
                &quot;metrics&quot;: [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

使用 `custom_expected`

custom_expected フィールドを使用して、その他の予期される情報をカスタムメトリクスに渡すことができます。

例：応答長を制限

この例は、各例に設定された(min_length, max_length)の範囲内で応答の長さを要求する方法を示しています。評価を作成するときに、カスタムメトリクスに渡される行レベルの情報を保存するためにcustom_expectedを使用します。

Python
import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

evals = [
  {
    "request": "Good morning",
    "response": "Good night.",
    "custom_expected": {
      "max_length": 100,
      "min_length": 3
    }
  }, {
    "request": "What is the date?",
    "response": "12/19/2024",
    "custom_expected": {
      "min_length": 10,
      "max_length": 20,
    }
  }
]

# The custom metric uses the "min_length" and "max_length" from the "custom_expected" field.
@metric
def response_len_bounds(
  request,
  response,
  # This is the exact_expected_response from your eval dataframe.
  custom_expected
):
  return len(response) <= custom_expected["max_length"] and len(response) >= custom_expected["min_length"]

with mlflow.start_run(run_name="response_len_bounds"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_len_bounds],
        # Disable built-in judges.
        evaluator_config={
            'databricks-agent': {
                &quot;metrics&quot;: [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

トレースのアサーション

カスタムメトリクスは、エージェントによって生成されるMLflow トレースのあらゆる部分 (スパン、属性、出力など) を評価できます。

例: 分類とルーティングのリクエスト

この例では、ユーザーのクエリが質問かステートメントかを判断し、それを平易な英語でユーザーに返すエージェントを構築します。より現実的なシナリオでは、この手法を使用して、異なるクエリを異なる機能にルーティングできます。

評価セットは、MLFlowトレースを検査するカスタムメトリクスを使用することで、クエリタイプ分類器が入力セットに対して適切な結果を生成することを保証します。

この例では、MLflow Trace.search_spansを使用して、このエージェント用に定義したカスタムスパンタイプである型KEYWORDのスパンを検索します。

Python

import mlflow
import pandas as pd
from mlflow.types.llm import ChatCompletionResponse, ChatCompletionRequest
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment
from mlflow.entities import Trace
from mlflow.deployments import get_deploy_client

# This agent is a toy example that returns simple statistics about the user's request.
# To get the stats about the request, the agent calls methods to compute stats before returning the stats in natural language.

deploy_client = get_deploy_client("databricks")
ENDPOINT_NAME="databricks-meta-llama-3-3-70b-instruct"

@mlflow.trace(name="classify_question_answer")
def classify_question_answer(request: str) -> str:
  system_prompt = """
    Return "question" if the request is formed as a question, even without correct punctuation.
    Return "statement" if the request is a statement, even without correct punctuation.
    Return "unknown" otherwise.

    Do not return a preamble, only return a single word.
  """
  request = {
    "messages": [
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": request},
    ],
    "temperature": .01,
    "max_tokens": 1000
  }

  result = deploy_client.predict(endpoint=ENDPOINT_NAME, inputs=request)
  return result.choices[0]['message']['content']

@mlflow.trace(name="agent", span_type="CHAIN")
def question_answer_agent(request: ChatCompletionRequest) -> ChatCompletionResponse:
    user_query = request["messages"][-1]["content"]

    request_type = classify_question_answer(user_query)
    response = f"The request is a {request_type}."

    return {
        "messages": [
            *request["messages"][:-1], # Keep the chat history.
            {"role": "user", "content": response}
        ]
    }

# Define the evaluation set with a set of requests and the expected request types for those requests.
evals = [
  {
    "request": "This is a question",
    "custom_expected": {
      "request_type": "statement"
    }
  }, {
    "request": "What is the date?",
    "custom_expected": {
      "request_type": "question"
    }
  },
]

# The custom metric checks the expected request type against the actual request type produced by the agent trace.
@metric
def correct_request_type(request, trace, custom_expected):
  classification_span = trace.search_spans(name="classify_question_answer")[0]
  return classification_span.outputs == custom_expected['request_type']

with mlflow.start_run(run_name="multiple_assessments_single_metric"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model=question_answer_agent,
        model_type="databricks-agent",
        extra_metrics=[correct_request_type],
        evaluator_config={
            'databricks-agent': {
                &quot;metrics&quot;: [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

これらの例を活用することで、独自の評価ニーズを満たすカスタムメトリクスを設計できます。

ツール呼び出しの評価

カスタムメトリクスは、呼び出されたツールとそれらが返した情報を提供するToolCallInvocationのリストであるtool_callsとともに提供されます。

例: 適切なツールが呼び出されることをアサートする

注記

この例はLangGraphエージェントを定義していないため、コピー＆ペーストできません。完全に実行可能な例については、添付のノートブックを参照してください。

Python
import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

eval_data = pd.DataFrame(
  [
    {
      "request": "what is 3 * 12?",
      "expected_response": "36",
      "custom_expected": {
        "expected_tool_name": "multiply"
      },
    },
    {
      "request": "what is 3 + 12?",
      "expected_response": "15",
      "custom_expected": {
        "expected_tool_name": "add"
      },
    },
  ]
)

@metric
def is_correct_tool(tool_calls, custom_expected):
  # Metric to check whether the first tool call is the expected tool
  return tool_calls[0].tool_name == custom_expected["expected_tool_name"]

@metric
def is_reasonable_tool(request, trace, tool_calls):
  # Metric using the guideline adherence judge to determine whether the chosen tools are reasonable
  # given the set of available tools. Note that `guidelines_context` requires `databricks-agents >= 0.20.0`

  return judges.guideline_adherence(
    request=request["messages"][0]["content"],
    guidelines=[
      "The selected tool must be a reasonable tool call with respect to the request and available tools.",
    ],
    # `guidelines_context` requires `databricks-agents>=0.20.0`
    guidelines_context={
      &quot;available_tools&quot;: str(tool_calls[0].available_tools),
      &quot;chosen_tools&quot;: str([tool_call.tool_name for tool_call in tool_calls]),
    },
  )

results = mlflow.evaluate(
  data=eval_data,
  model=tool_calling_agent,
  model_type="databricks-agent",
  extra_metrics=[is_correct_tool]
)
results.tables["eval_results"].display()

カスタムメトリクスを開発する

メトリクスを開発する際、変更を加えるたびにエージェントを実行することなく、メトリクスを迅速に反復処理する必要があります。これを簡素化するために、次の戦略を使用します。

評価データセットエージェントから解答シートを生成します。これは評価セット内の各エントリに対してエージェントを実行し、応答とトレースを生成します。それらを使用してメトリクスを直接呼び出すことができます。
メトリクスを定義します。
解答シートの各値についてメトリクスを直接呼び出し、メトリクス定義を反復処理します。
メトリクスが期待どおりに動作している場合、同じ解答シートで mlflow.evaluate() を実行し、Agent Evaluation の実行結果が期待どおりであることを確認してください。この例のコードは model= フィールドを使用していないため、評価では事前計算された応答を使用します。
メトリクスのパフォーマンスに満足したら、mlflow.evaluate()のmodel=フィールドを有効にして、対話形式でエージェントを呼び出します。

Py
import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment
from mlflow.entities import Trace

evals = [
  {
    "request": "What is Databricks?",
    "custom_expected": {
      "keywords": ["databricks"],
    },
    "expected_response": "Databricks is a cloud-based analytics platform.",
    "expected_facts": ["Databricks is a cloud-based analytics platform."],
    "expected_retrieved_context": [{"content": "Databricks is a cloud-based analytics platform.", "doc_uri": "https://databricks.com/doc_uri"}]
  }, {
    "request": "When was Databricks founded?",
    "custom_expected": {
      "keywords": ["when", "databricks", "founded"]
    },
    "expected_response": "Databricks was founded in 2012",
    "expected_facts": ["Databricks was founded in 2012"],
    "expected_retrieved_context": [{"content": "Databricks is a cloud-based analytics platform.", "doc_uri": "https://databricks.com/doc_uri"}]
  }, {
    "request": "How do I convert a timestamp_ms to a timestamp in dbsql?",
    "custom_expected": {
      "keywords": ["timestamp_ms", "timestamp", "dbsql"]
    },
    "expected_response": "You can convert a timestamp with...",
    "expected_facts": ["You can convert a timestamp with..."],
    "expected_retrieved_context": [{"content": "You can convert a timestamp with...", "doc_uri": "https://databricks.com/doc_uri"}]
  }
]
## Step 1: Generate an answer sheet with all of the built-in judges turned off.
## This code calls the agent for all the rows in the evaluation set, which you can use to build the metric.
answer_sheet_df = mlflow.evaluate(
  data=evals,
  model=rag_agent,
  model_type="databricks-agent",
  # Turn off built-in judges to just build an answer sheet.
  evaluator_config={&quot;databricks-agent&quot;: {&quot;metrics&quot;: []}
  }
).tables['eval_results']
display(answer_sheet_df)

answer_sheet = answer_sheet_df.to_dict(orient='records')

## Step 2: Define the metric.
@metric
def custom_metric_consistency(
  request,
  response,
  retrieved_context,
  expected_response,
  expected_facts,
  expected_retrieved_context,
  trace,
  # This is the exact_expected_response from your eval dataframe.
  custom_expected
):
  print(f"[custom_metric] request: {request}")
  print(f"[custom_metric] response: {response}")
  print(f"[custom_metric] retrieved_context: {retrieved_context}")
  print(f"[custom_metric] expected_response: {expected_response}")
  print(f"[custom_metric] expected_facts: {expected_facts}")
  print(f"[custom_metric] expected_retrieved_context: {expected_retrieved_context}")
  print(f"[custom_metric] trace: {trace}")

  return True

## Step 3: Call the metric directly before using the evaluation harness to iterate on the metric definition.
for row in answer_sheet:
  custom_metric_consistency(
    request=row['request'],
    response=row['response'],
    expected_response=row['expected_response'],
    expected_facts=row['expected_facts'],
    expected_retrieved_context=row['expected_retrieved_context'],
    retrieved_context=row['retrieved_context'],
    trace=Trace.from_json(row['trace']),
    custom_expected=row['custom_expected']
  )

## Step 4: After you are confident in the signature of the metric, you can run the harness with the answer sheet to trigger the output validation and make sure the UI reflects what you intended.
with mlflow.start_run(run_name="exact_expected_response"):
    eval_results = mlflow.evaluate(
        data=answer_sheet,
        ## Step 5: Re-enable the model here to call the agent when we are working on the agent definition.
        # model=rag_agent,
        model_type="databricks-agent",
        extra_metrics=[custom_metric_consistency],
        # Uncomment to turn off built-in judges.
        # evaluator_config={
        #     'databricks-agent': {
        #         &quot;metrics&quot;: [],
        #     }
        # }
    )
    display(eval_results.tables['eval_results'])

ノートブックの例

次のノートブックは、Agent Evaluation でカスタムメトリクスを使用するいくつかの異なる方法を示しています。

Agent Evaluationカスタムメトリクスのサンプルノートブック

ノートブックを新しいタブで開く Open in Databricks

概要​

使い方​

@metric デコレーター​

引数の説明​

戻り値​

合格/不合格メトリクス​

取得したコンテキストにPIIが含まれていないことを確認してください。​

数値メトリクス​

例：応答の類似性​

Boolean メトリクス​

例：入力リクエストが正しくフォーマットされているか確認します。​

例: 言語モデルの自己参照​

使用 custom_expected​

例：応答長を制限​

トレースのアサーション​

例: 分類とルーティングのリクエスト​

ツール呼び出しの評価​

例: 適切なツールが呼び出されることをアサートする​

カスタムメトリクスを開発する​

ノートブックの例​