プロンプトバージョンの評価と比較

備考

ベータ版

この機能はベータ版です。

このガイドでは、さまざまなプロンプトバージョンを体系的に評価して、エージェントとGenAIアプリケーションにとって最も効果的なバージョンを特定する方法を示します。プロンプトバージョンを作成し、予想される事実を使用して評価データセットを作成し、MLflow の評価フレームワークを使用してパフォーマンスを比較する方法を学習します。

このページのすべてのコードは、サンプルノートブックに含まれています。

前提条件

このガイドには次のものが必要です。

MLflow の 3.1.0またはそれ以上。
OpenAI API アクセスまたは Databricks モデルサービング.
CREATE FUNCTION、 EXECUTE、および MANAGE 権限を持つ Unity Catalog スキーマ。

ステップ 1: 環境を構成する

注記

プロンプトと評価データセットを作成するには、カタログとスキーマの両方に対する CREATE FUNCTION、 EXECUTE、 MANAGE の権限が必要です。

まず、Unity Catalog スキーマを設定し、必要なパッケージをインストールします。

Python
# Install required packages
%pip install --upgrade "mlflow[databricks]>=3.1.0" openai
dbutils.library.restartPython()

# Configure your Unity Catalog schema
import mlflow
import pandas as pd
from openai import OpenAI
import uuid

CATALOG = "main"        # Replace with your catalog name
SCHEMA = "default"      # Replace with your schema name

# Create unique names for the prompt and dataset
SUFFIX = uuid.uuid4().hex[:8]  # Short unique suffix
PROMPT_NAME = f"{CATALOG}.{SCHEMA}.summary_prompt_{SUFFIX}"
EVAL_DATASET_NAME = f"{CATALOG}.{SCHEMA}.summary_eval_{SUFFIX}"

print(f"Prompt name: {PROMPT_NAME}")
print(f"Evaluation dataset: {EVAL_DATASET_NAME}")

# Set up OpenAI client
client = OpenAI()

ステップ 2: プロンプトバージョンを作成する

登録するタスクに対するさまざまなアプローチを表すさまざまなプロンプトバージョン:

Python
# Version 1: Basic prompt
prompt_v1 = mlflow.genai.register_prompt(
    name=PROMPT_NAME,
    template="Summarize this text: {{content}}",
    commit_message="v1: Basic summarization prompt"
)

print(f"Created prompt version {prompt_v1.version}")

# Version 2: Improved with comprehensive guidelines
prompt_v2 = mlflow.genai.register_prompt(
    name=PROMPT_NAME,
    template="""You are an expert summarizer. Create a summary of the following content in *exactly* 2 sentences (no more, no less - be very careful about the number of sentences).

Guidelines:
- Include ALL core facts and key findings
- Use clear, concise language
- Maintain factual accuracy
- Cover all main points mentioned
- Write for a general audience
- Use exactly 2 sentences

Content: {{content}}

Summary:""",
    commit_message="v2: Added comprehensive fact coverage with 2-sentence requirement"
)

print(f"Created prompt version {prompt_v2.version}")

ステップ 3: 評価データセットを作成する

適切な要約に表示されるはずの予想される事実を含むデータセットを構築します。

Python
# Create evaluation dataset
eval_dataset = mlflow.genai.datasets.create_dataset(
    uc_table_name=EVAL_DATASET_NAME
)

# Add summarization examples with expected facts
evaluation_examples = [
    {
        "inputs": {
            "content": """Remote work has fundamentally changed how teams collaborate and communicate. Companies have adopted new digital tools for video conferencing, project management, and file sharing. While productivity has remained stable or increased in many cases, challenges include maintaining company culture, ensuring work-life balance, and managing distributed teams across time zones. The shift has also accelerated digital transformation initiatives and changed hiring practices, with many companies now recruiting talent globally rather than locally."""
        },
        "expectations": {
            "expected_facts": [
                "remote work changed collaboration",
                "digital tools adoption",
                "productivity remained stable",
                "challenges with company culture",
                "work-life balance issues",
                "global talent recruitment"
            ]
        }
    },
    {
        "inputs": {
            "content": """Electric vehicles are gaining mainstream adoption as battery technology improves and charging infrastructure expands. Major automakers have committed to electrification with new models launching regularly. Government incentives and environmental regulations are driving consumer interest. However, challenges remain including higher upfront costs, limited charging stations in rural areas, and concerns about battery life and replacement costs. The market is expected to grow significantly over the next decade."""
        },
        "expectations": {
            "expected_facts": [
                "electric vehicles gaining adoption",
                "battery technology improving",
                "charging infrastructure expanding",
                "government incentives",
                "higher upfront costs",
                "limited rural charging",
                "market growth expected"
            ]
        }
    },
    {
        "inputs": {
            "content": """Artificial intelligence is transforming healthcare through diagnostic imaging, drug discovery, and personalized treatment plans. Machine learning algorithms can now detect diseases earlier and more accurately than traditional methods. AI-powered robots assist in surgery and patient care. However, concerns exist about data privacy, algorithm bias, and the need for regulatory oversight. Healthcare providers must balance innovation with patient safety and ethical considerations."""
        },
        "expectations": {
            "expected_facts": [
                "AI transforming healthcare",
                "diagnostic imaging improvements",
                "drug discovery acceleration",
                "personalized treatment",
                "earlier disease detection",
                "data privacy concerns",
                "algorithm bias issues",
                "regulatory oversight needed"
            ]
        }
    }
]

eval_dataset = eval_dataset.merge_records(evaluation_examples)
print(f"Added {len(evaluation_examples)} summarization examples to evaluation dataset")

ステップ 4: 評価関数とカスタムメトリクスを作成する

プロンプトバージョンを使用する関数を定義し、カスタム評価メトリクスを作成します。

Python
def create_summary_function(prompt_name: str, version: int):
    """Create a summarization function for a specific prompt version."""

    @mlflow.trace
    def summarize_content(content: str) -> dict:
        # Load the prompt version
        prompt = mlflow.genai.load_prompt(
            name_or_uri=f"prompts:/{prompt_name}/{version}"
        )

        # Format and call the LLM
        formatted_prompt = prompt.format(content=content)

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": formatted_prompt}],
            temperature=0.1
        )

        return {"summary": response.choices[0].message.content}

    return summarize_content

カスタムプロンプトでジャッジを作成する

特定の基準を評価するためのカスタムプロンプトを持つ審査員を作成します。

Python
from mlflow.genai.judges import custom_prompt_judge
from mlflow.genai.scorers import scorer

# Create a custom prompt judge
sentence_count_judge = custom_prompt_judge(
    name="sentence_count_compliance",
    prompt_template="""Evaluate if this summary follows the 2-sentence requirement:

Summary: {{summary}}

Count the sentences carefully and choose the appropriate rating:

[[correct]]: Exactly 2 sentences - follows instructions correctly
[[incorrect]]: Not exactly 2 sentences - does not follow instructions""",
    numeric_values={
        "correct": 1.0,
        "incorrect": 0.0
    }
)

# Wrap the judge in a scorer
@scorer
def sentence_compliance_scorer(inputs, outputs, trace) -> bool:
    """Custom scorer that evaluates sentence count compliance."""
    result = sentence_count_judge(summary=outputs.get("summary", ""))
    return result.value == 1.0  # Convert to boolean

ステップ 5: 比較評価を実行する

組み込みスコアラーとカスタムスコアラーの両方を使用して、各プロンプトバージョンを評価します。

Python
from mlflow.genai.scorers import Correctness

# Define scorers
scorers = [
    Correctness(),  # Checks expected facts
    sentence_compliance_scorer,  # Custom sentence count metric
]

# Evaluate each version
results = {}

for version in [1, 2]:
    print(f"\nEvaluating version {version}...")

    with mlflow.start_run(run_name=f"summary_v{version}_eval"):
        mlflow.log_param("prompt_version", version)

        # Run evaluation
        eval_results = mlflow.genai.evaluate(
            predict_fn=create_summary_function(PROMPT_NAME, version),
            data=eval_dataset,
            scorers=scorers,
        )

        results[f"v{version}"] = eval_results
        print(f"  Correctness score: {eval_results.metrics.get('correctness/mean', 0):.2f}")
        print(f"  Sentence compliance: {eval_results.metrics.get('sentence_compliance_scorer/mean', 0):.2f}")

ステップ 6: 結果を比較し、最適なバージョンを選択する

結果を分析して、最もパフォーマンスの高いプロンプトを特定します。

Python
# Compare versions across all metrics
print("=== Version Comparison ===")
for version, result in results.items():
    correctness_score = result.metrics.get('correctness/mean', 0)
    compliance_score = result.metrics.get('sentence_compliance_scorer/mean', 0)
    print(f"{version}:")
    print(f"  Correctness: {correctness_score:.2f}")
    print(f"  Sentence compliance: {compliance_score:.2f}")
    print()

# Calculate composite scores
print("=== Composite Scores ===")
composite_scores = {}
for version, result in results.items():
    correctness = result.metrics.get('correctness/mean', 0)
    compliance = result.metrics.get('sentence_compliance_scorer/mean', 0)
    # Weight correctness more heavily (70%) than compliance (30%)
    composite = 0.7 * correctness + 0.3 * compliance
    composite_scores[version] = composite
    print(f"{version}: {composite:.2f}")

# Find best version
best_version = max(composite_scores.items(), key=lambda x: x[1])
print(f"\nBest performing version: {best_version[0]} (score: {best_version[1]:.2f})")

# Show why this version is best
best_results = results[best_version[0]]
print(f"\nWhy {best_version[0]} is best:")
print(f"- Captures {best_results.metrics.get('correctness/mean', 0):.0%} of expected facts")
print(f"- Follows sentence requirements {best_results.metrics.get('sentence_compliance_scorer/mean', 0):.0%} of the time")

評価を通じて最もパフォーマンスの高いプロンプトバージョンを特定したら、デプロイする準備が整います。本番運用デプロイにエイリアスを使用する方法については、デプロイされたアプリでプロンプトを使用するを参照してください。

ノートブックの例

完全な動作例については、次のノートブックを参照してください。

GenAI アプリの評価クイックスタートノートブック

Open notebook in new tab

次のステップ

アプリのバージョンでプロンプトを追跡する - 評価されたプロンプトのバージョンをアプリケーションのバージョンにリンクします
デプロイされたアプリでプロンプトを使用する - エイリアスを使用して、最もパフォーマンスの高いプロンプトをデプロイします
カスタムスコアラーの作成 - ドメイン固有の評価メトリクスを作成する

プロンプトバージョンの評価と比較

前提条件

おすすめの方法

ステップ 1: 環境を構成する

ステップ 2: プロンプトバージョンを作成する

ステップ 3: 評価データセットを作成する

ステップ 4: 評価関数とカスタムメトリクスを作成する

カスタムプロンプトでジャッジを作成する

ステップ 5: 比較評価を実行する

ステップ 6: 結果を比較し、最適なバージョンを選択する

ノートブックの例

GenAI アプリの評価クイックスタートノートブック

関連リンク

次のステップ

前提 条件​

おすすめの方法​

ステップ 1: 環境を構成する​

ステップ 2: プロンプト バージョンを作成する​

ステップ 3: 評価データセットを作成する​

ステップ 4: 評価関数とカスタムメトリクスを作成する​

カスタムプロンプトでジャッジを作成する​

ステップ 5: 比較評価を実行する​

ステップ 6: 結果を比較し、最適なバージョンを選択する​

ノートブックの例​

GenAI アプリの評価クイックスタートノートブック

関連リンク​

次のステップ​

前提条件

おすすめの方法

ステップ 1: 環境を構成する

ステップ 2: プロンプトバージョンを作成する

ステップ 3: 評価データセットを作成する

ステップ 4: 評価関数とカスタムメトリクスを作成する

カスタムプロンプトでジャッジを作成する

ステップ 5: 比較評価を実行する

ステップ 6: 結果を比較し、最適なバージョンを選択する

ノートブックの例

関連リンク

次のステップ