Agent Evaluationから MLflow 3 への移行: クイックリファレンス

このクイックリファレンスでは、 Agent EvaluationとMLflow 2 から MLflow 3の改善されたAPIに移行するための主な変更点をまとめています。Agent EvaluationからMLflow 3 への移行に関する完全なガイドを参照してください。

更新プログラムのインポート

Python
### Old imports ###
from mlflow import evaluate
from databricks.agents.evals import metric
from databricks.agents.evals import judges

from databricks.agents import review_app

### New imports ###
from mlflow.genai import evaluate
from mlflow.genai.scorers import scorer
from mlflow.genai import judges
# For predefined scorers:
from mlflow.genai.scorers import (
    Correctness, Guidelines, ExpectationsGuidelines,
    RelevanceToQuery, Safety, RetrievalGroundedness,
    RetrievalRelevance, RetrievalSufficiency
)

import mlflow.genai.labeling as labeling
import mlflow.genai.label_schemas as schemas

評価機能

MLflow 2.x	MLflow 3.x
`mlflow.evaluate()`	`mlflow.genai.evaluate()`
`model=my_agent`	`predict_fn=my_agent`
`model_type="databricks-agent"`	(不要)
`extra_metrics=[...]`	`scorers=[...]`
`evaluator_config={...}`	(スコアラーでの設定)

ジャッジの選択

MLflow 2.x	MLflow 3.x
データに基づいて該当するすべてのジャッジを自動的に実行します	使用するスコアラーを明示的に指定する必要があります
`evaluator_config`を使用してジャッジを制限する	`scorers`パラメーターで目的のスコアラーをパスする
`global_guidelines` コンフィグで	`Guidelines()`スコアラーを使用する
利用可能なデータフィールドに基づいて選ばれるジャッジ	どのスコアラーを実行するかを正確に制御します

データフィールド

MLflow 2.x フィールド	MLflow 3.x フィールド	説明
`request`	`inputs`	エージェント入力
`response`	`outputs`	エージェントの出力
`expected_response`	`expectations`	グラウンドトゥルース
`retrieved_context`	トレース経由でアクセス	トレースからのコンテキスト
`guidelines`	スコアラー設定の一部	スコアラーレベルに移動

カスタムメトリクスとスコアラー

MLflow 2.x	MLflow 3.x	注
`@metric` デコレータ	`@scorer` デコレータ	新規名
`def my_metric(request, response, ...)`	`def my_scorer(inputs, outputs, expectations, traces)`	簡易
複数の expected_* パラメータ	dict である 1 つの `expectations` パラメータ	連結
`custom_expected`	`expectations` dict の一部	簡易
`request` パラメーター	`inputs` パラメーター	一貫した命名
`response` パラメーター	`outputs` パラメーター	一貫した命名

結果へのアクセス

MLflow 2.x	MLflow 3.x
`results.tables['eval_results']`	`mlflow.search_traces(run_id=results.run_id)`
DataFrame への直接アクセス	トレースと評価の反復処理

LLMジャッジ

ユースケース	MLflow 2.x	MLflow 3.x 推奨
基本的な正確性チェック	`judges.correctness()` in `@metric`	`Correctness()` スコアラーまたは `judges.is_correct()` ジャッジ
安全性評価	`judges.safety()` in `@metric`	`Safety()` スコアラーまたは `judges.is_safe()` ジャッジ
グローバルガイドライン	`judges.guideline_adherence()`	`Guidelines()` スコアラーまたは `judges.meets_guidelines()` ジャッジ
評価セット行ごとのガイドライン	`judges.guideline_adherence()` expected_*	`ExpectationsGuidelines()` スコアラーまたは `judges.meets_guidelines()` ジャッジ
事実に基づく裏付けを確認する	`judges.groundedness()`	`judges.is_grounded()` または `RetrievalGroundedness()` スコアラー
コンテキストの関連性を確認する	`judges.relevance_to_query()`	`judges.is_context_relevant()` または `RelevanceToQuery()` スコアラー
コンテキストチャンクの関連性を確認する	`judges.chunk_relevance()`	`judges.is_context_relevant()` または `RetrievalRelevance()` スコアラー
コンテキストの完全性を確認する	`judges.context_sufficiency()`	`judges.is_context_sufficient()` または `RetrievalSufficiency()` スコアラー
複雑なカスタムロジック	直接のジャッジコール `@metric`	事前定義されたスコアラーまたはジャッジコールによる`@scorer`

人間のフィードバック

MLflow 2.x	MLflow 3.x
`databricks.agents.review_app`	`mlflow.genai.labeling`
`databricks.agents.datasets`	`mlflow.genai.datasets`
`review_app.label_schemas.*`	`mlflow.genai.label_schemas.*`
`app.create_labeling_session()`	`labeling.create_labeling_session()`

一般的な移行コマンド

Bash
# Find old evaluate calls
grep -r "mlflow.evaluate" . --include="*.py"

# Find old metric decorators
grep -r "@metric" . --include="*.py"

# Find old data fields
grep -r '"request":\|"response":\|"expected_response":' . --include="*.py"

# Find old imports
grep -r "databricks.agents" . --include="*.py"

追加のリソース

移行中のその他のサポートについては、MLflow のドキュメントを参照するか、Databricks サポートチームにお問い合わせください。

更新プログラムのインポート​

評価機能​

ジャッジの選択​

データフィールド​

カスタムメトリクスとスコアラー​

結果へのアクセス​

LLMジャッジ​

人間のフィードバック​

一般的な移行コマンド​

追加のリソース​