LLMジャッジガイドラインを作成する

ガイドライン LLM ジャッジは、合格/不合格の自然言語基準を使用して GenAI の出力を評価します。以下の評価に優れています:

コンプライアンス : 「価格情報を含めてはならない」
スタイル/トーン : 「プロフェッショナルで共感的なトーンを維持する」
要件 : 「特定の免責事項を含める必要があります」
正確性 : 「提供されたコンテキストの事実のみを使用する」

詳細なドキュメントと追加の例については、 MLflow ガイドライン審査員ドキュメントを参照してください。

ガイドライン LLM 審査員は次のような利点を提供します。

ビジネスフレンドリー :ドメインの専門家はコーディングなしで基準を作成します
柔軟性 : コードを変更せずに条件を更新
解釈可能 :クリアな合格/不合格条件
迅速なイテレーション : 新しい条件を迅速にテストします

ガイドラインジャッジの活用方法

MLflow はジャッジに次のガイドラインを提供します。

組み込みのGuidelines()ジャッジ : グローバルガイドラインをすべての行に均一に適用します。アプリの入力と出力のみを評価します。オフライン評価と本番運用モニタリングの両方で機能します。
組み込みのExpectationsGuidelines()ジャッジ : 評価データセット内のドメインエキスパートによってラベル付けされた行ごとのガイドラインを適用します。アプリの入力と出力のみを評価します。オフライン評価のみ。

API の詳細については、MLflow のドキュメントを参照してください。

ガイドラインのしくみ

ガイドラインジャッジは、特別に調整された LLM を使用して、テキストが指定された基準を満たしているかどうかを評価します。ジャッジは次のように述べた。

受け取るコンテキスト ：評価するデータを含む任意のJSON辞書（例：request、response、retrieved_documents、user_preferences）。ガイドラインで、これらのキーを名前で直接参照できます。コンテキスト変数の参照を参照してください。
ガイドラインを適用 : 合格/不合格条件を定義する自然言語ルール。
判断を下す : 詳細な根拠とともにバイナリの合格/不合格スコアを返します。

LLM審査員に影響を与えているモデルの詳細については、 LLM審査員に影響を与えているモデルに関する情報」を参照してください。

ジャッジ定義内のmodel引数を使用して、ジャッジモデルを変更できます。モデルは<provider>:/<model-name>形式で指定する必要があります。ここで、 <provider>は LiteLLM 互換のモデルプロバイダーです。モデルプロバイダーとしてdatabricksを使用する場合、モデル名はサービスエンドポイント名と同じになります。

例を実行するための前提条件

MLflow と必要なパッケージをインストールします。

Python
%pip install --upgrade "mlflow[databricks]>=3.4.0"
dbutils.library.restartPython()

MLflow エクスペリメントを作成するには、環境のセットアップに関するクイックスタートに従ってください。

`Guidelines()`審査員: グローバルガイドライン

Guidelines審査員は、本番運用モニタリングの評価またはトレースのすべての行に統一ガイドラインを適用します。トレースからリクエストと応答のデータを自動的に抽出し、ガイドラインに照らして評価します。

ガイドラインでは、アプリの入力をrequest 、アプリの出力をresponseとして参照します。次のコードはいくつかの簡単なガイドラインを作成します。

Python
from mlflow.genai.scorers import Guidelines
import mlflow

# Example data
data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": {"response": "The capital of France is Paris."}
    },
    {
        "inputs": {"question": "What is the capital of Germany?"},
        "outputs": {"response": "The capital of Germany is Berlin."}
    }
]

# Create scorers with global guidelines
english = Guidelines(
    name="english",
    guidelines=["The response must be in English"]
)

clarity = Guidelines(
    name="clarity",
    guidelines=["The response must be clear, coherent, and concise"],
    model="databricks:/databricks-gpt-oss-120b",  # Optional custom judge model
)

# Evaluate with global guidelines
results = mlflow.genai.evaluate(
    data=data,
    scorers=[english, clarity]
)

次の例では、サンプルの入力と出力を含む評価データセットを作成します。次に、応答のトーンを判断するガイドラインを定義して実行します。

Python
from mlflow.genai.scorers import Guidelines
import mlflow


# Create evaluation dataset with pre-computed outputs
eval_dataset = [
    {
        "inputs": {
            "messages": [{"role": "user", "content": "My order hasn't arrived yet"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I understand your concern about the delayed order. Let me help you track it right away."
                }
            }]
        },
    },
    {
        "inputs": {
            "messages": [{"role": "user", "content": "How do I reset my password?"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "To reset your password, click 'Forgot Password' on the login page. You'll receive an email within 5 minutes."
                }
            }]
        },
    }
]

# Run evaluation on existing outputs
results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[Guidelines(name="tone", guidelines="The response must maintain a courteous, respectful tone throughout. It must show empathy for customer concerns in the request"),]
)

パラメーター

パラメーター	Type	必須	説明
`name`	`str`	Yes	評価結果に表示されるジャッジの名前
`guidelines`	`str \| list[str]`	Yes	すべての行に均一に適用するためのガイドライン
`model`	`str`	No	カスタムジャッジモデル

パラメーター	Type	必須	説明
`name`	`str`	Yes	評価結果に表示されるジャッジの名前
`guidelines`	`str \| list[str]`	Yes	すべての行に均一に適用するためのガイドライン
`model`	`str`	No	カスタムジャッジモデル

ガイドライン審査員がアプリの入力と出力を解析する方法

ガイドラインジャッジは、トレースからデータを自動的に抽出し、キーrequestとresponseを使用してガイドラインのコンテキストを作成します。

依頼

requestフィールドは、提供されたinputsから抽出されます。

inputsにOpenAI 形式のチャットメッセージの配列を含むmessagesキーが含まれている場合:
- メッセージが 1 つしかない場合、 requestはそのメッセージのcontentになります。
- メッセージが複数ある場合、 requestは JSON 文字列にシリアル化されたメッセージ配列全体になります。
それ以外の場合、inputs dict 全体が JSON 文字列にシリアル化されますrequest。

リクエスト例

1 つのメッセージ入力 :

Python
# Input
inputs = {
    "messages": [
        {"role": "user", "content": "How can I reset my password?"}
    ]
}

# Parsed request
"How can I reset my password?"

複数ターンの会話 :

Python
# Input
inputs = {
    "messages": [
        {"role": "user", "content": "What is MLflow?"},
        {"role": "assistant", "content": "MLflow is an open source AI engineering platform..."},
        {"role": "user", "content": "Tell me more about tracing"}
    ]
}

# Parsed request (JSON string)
'[{"role": "user", "content": "What is MLflow?"}, {"role": "assistant", "content": "MLflow is an open source AI engineering platform..."}, {"role": "user", "content": "Tell me more about tracing"}]'

任意の辞書 :

Python
# Input
inputs = {"key1": "Explain MLflow evaluation", "key2": "something else"}

# Parsed request
'{"key1": "Explain MLflow evaluation", "key2": "something else"}'

応答

responseフィールドは、提供されたoutputsから抽出されます。

outputsに OpenAI 形式のChatCompletionsオブジェクトが含まれている場合:
- response が最初の選択肢です content
outputs OpenAI形式のチャットメッセージの配列を含むmessagesキーが含まれている場合
- response は最後のメッセージの content
それ以外の場合、 response は JSON 文字列にシリアル化された outputs です。

応答例

ChatCompletion の出力 :

Python
# Output (simplified)
outputs = {
    "choices": [{
        "message": {
            "content": "MLflow evaluation helps measure GenAI quality..."
        }
    }]
}

# Parsed response
"MLflow evaluation helps measure GenAI quality..."

メッセージ形式の出力 :

Python
# Output
outputs = {
    "messages": [
        {"role": "user", "content": "What are the ..."}
        {"role": "assistant", "content": "Here are the key features..."}
    ]
}

# Parsed response
"Here are the key features..."

任意の辞書 :

Python
# Input
inputs = {"key1": "Explain MLflow evaluation", "key2": "something else"}

# Parsed request
'{"key1": "Explain MLflow evaluation", "key2": "something else"}'

`ExpectationsGuidelines()`審査員: 行ごとのガイドライン

ExpectationsGuidelinesジャッジは、ドメインエキスパートからの行固有のガイドラインに照らして評価します。これにより、データセット内の各例に対して異なる評価基準が可能になります。

いつ使用するか

このジャッジは次の場合に使用します:

特定の例にカスタムガイドラインでラベルを付けたドメインエキスパートがいます
行が異なれば、必要な評価基準も異なります

例

ガイドラインでは、アプリの入力を request 、アプリの出力を responseと呼んでいます。

Python
from mlflow.genai.scorers import ExpectationsGuidelines
import mlflow

# Dataset with per-row guidelines
data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
        "expectations": {
            "guidelines": ["The response must be factual and concise"]
        }
    },
    {
        "inputs": {"question": "How to learn Python?"},
        "outputs": "You can read a book or take a course.",
        "expectations": {
            "guidelines": ["The response must be helpful and encouraging"]
        }
    }
]

# Evaluate with per-row guidelines
results = mlflow.genai.evaluate(
    data=data,
    scorers=[ExpectationsGuidelines()]
)

戻り値

ガイドラインジャッジは、次の内容を含むmlflow.entities.Feedbackオブジェクトを返します:

value: "yes" (ガイドラインを満たす) または"no" (ガイドラインに不合格)
rationale:コンテンツが合格または失敗した理由の詳細な説明
name: 評価名(指定または自動生成)
error: 評価に失敗した場合のエラー内容

効果的なガイドラインを書くためのベストプラクティス

正確な評価には、適切に作成されたガイドラインが不可欠です。このセクションでは、ガイドラインを作成する際のベストプラクティスについて説明します。

参照コンテキスト変数

コンテキストディクショナリの任意のキーをガイドラインに直接含めます。

Python
# Example 1: Validate against retrieved documents
context = {
    "request": "What is the refund policy?",
    "response": "You can return items within 30 days for a full refund.",
    "retrieved_documents": ["Policy: Returns accepted within 30 days", "Policy: No refunds after 30 days"]
}
guideline = "The response must only include information from retrieved_documents"

# Example 2: Check user preferences
context = {
    "request": "Recommend a restaurant",
    "response": "I suggest trying the new steakhouse downtown",
    "user_preferences": {"dietary_restrictions": "vegetarian", "cuisine": "Italian"}
}
guideline = "The response must respect user_preferences when making recommendations"

# Example 3: Enforce business rules
context = {
    "request": "Can you apply a discount?",
    "response": "I've applied a 15% discount to your order",
    "max_allowed_discount": 10,
    "user_tier": "silver"
}
guideline = "The response must not exceed max_allowed_discount for the user_tier"

# Example 4: Multiple constraints
context = {
    "request": "Tell me about product features",
    "response": "This product includes features A, B, and C",
    "approved_features": ["A", "B", "C", "D"],
    "deprecated_features": ["X", "Y", "Z"]
}
guideline = """The response must:
- Only mention approved_features
- Not include deprecated_features"""

追加のガイドライン

具体的かつ測定可能なものにする ✅「回答には具体的な価格やパーセンテージを含めないでください」 ❌「お金の話はしないで」

明確な合格/不合格条件を使用する ✅「価格について尋ねられた場合、回答はユーザーを価格ページに誘導する必要があります」 ❌「価格に関する質問に適切に対応」

コンテキストを明示的に参照する ✅「レスポンスでは、retrieved_context に存在するファクトのみを使用する必要があります」 ❌「事実に基づいて」

複雑な要件の構造

Python
guideline = """The response must:
- Include a greeting if first message
- Address the user's specific question
- End with an offer to help further
- Not exceed 150 words"""

実社会の例

顧客サービス chatbot

ここでは、さまざまなシナリオで顧客サービスチャットボットを評価するための実用的なガイドラインの例を示します。

すべてのインタラクションに関するグローバルガイドライン

Python
from mlflow.genai.scorers import Guidelines
import mlflow

# Define global standards for all customer interactions
tone_guidelines = Guidelines(
    name="customer_service_tone",
    guidelines="""The response must maintain our brand voice which is:
    - Professional yet warm and conversational (avoid corporate jargon)
    - Empathetic, acknowledging emotional context before jumping to solutions
    - Proactive in offering help without being pushy

    Specifically:
    - If the customer expresses frustration, anger, or disappointment, the first sentence must acknowledge their emotion
    - The response must use "I" statements to take ownership (e.g., "I understand" not "We understand")
    - The response must avoid phrases that minimize concerns like "simply", "just", or "obviously"
    - The response must end with a specific next step or open-ended offer to help, not generic closings"""
)

compliance_guidelines = Guidelines(
    name="policy_compliance",
    guidelines=[
        """Refund and return policies:
        - The response must not promise refunds beyond the 30-day return window
        - The response must mention restocking fees for electronics (15%) if discussing returns
        - For items over $500, the response must note that manager approval is required
        - The response must not waive shipping fees without checking customer loyalty tier""",

        """Data privacy and security:
        - The response must never ask for full credit card numbers, SSN, or passwords
        - The response must not reference other customers' orders or information
        - When discussing order details, the response must only reference the last 4 digits of payment methods
        - The response must direct customers to secure portal for updating payment information""",

        """Commitment limitations:
        - The response must not guarantee specific delivery dates without system verification
        - The response must include disclaimer about "estimated" times when discussing shipping
        - The response must not promise price matches without mentioning the verification process
        - For out-of-stock items, the response must not commit to restock dates"""
    ]
)

# Example customer service data
customer_interactions = [
    {
        "inputs": {
            "messages": [{"role": "user", "content": "I've been waiting 2 hours on hold! This is ridiculous!"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I completely understand your frustration about the long wait time. That's absolutely not the experience we want you to have. I'm here to help you right now - what can I assist you with today?"
                }
            }]
        }
    },
    {
        "inputs": {
            "messages": [{"role": "user", "content": "Can I return my laptop? I bought it 45 days ago."}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I'd be happy to look into return options for you. Since our standard return window is 30 days and your purchase was 45 days ago, I'll need to check if there are any exceptions we can make. For electronics like laptops, we do have a 15% restocking fee. May I have your order number to review your specific situation?"
                }
            }]
        }
    }
]

# Evaluate customer service interactions
results = mlflow.genai.evaluate(
    data=customer_interactions,
    scorers=[tone_guidelines, compliance_guidelines]
)

特定のシナリオの行ごとのガイドライン

Python
from mlflow.genai.scorers import ExpectationsGuidelines
import mlflow

# Dataset with scenario-specific guidelines
customer_service_data = [
    {
        "inputs": {
            "messages": [{"role": "user", "content": "My package is 3 weeks late and I need it for an event tomorrow!"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I can absolutely understand how stressful this must be, especially with your event tomorrow. I'm so sorry your package has been delayed for 3 weeks - that's completely unacceptable. Let me help you immediately. I'm checking our local store inventory right now for same-day pickup options. I'm also processing a full refund of your shipping costs and applying a 25% discount to your order for this inconvenience. Would you prefer overnight delivery to arrive by 10 AM tomorrow, or would same-day pickup from our downtown location work better for you? I'm also connecting you with our escalation team - you'll receive a call within the next hour at the number on your account."
                }
            }]
        },
        "expectations": {
            "guidelines": [
                """The response must handle this urgent situation with exceptional care:
                - First acknowledge both the delay AND the urgent tomorrow deadline
                - Express genuine empathy using phrases like "I can absolutely understand how stressful this must be"
                - Must NOT make excuses or blame shipping partners""",

                """The response must provide immediate actionable solutions:
                - Offer to check local store availability for same-day pickup
                - If available, offer overnight shipping at no charge given the delay
                - Provide direct escalation path: manager callback within 2 hours
                - Include the executive customer service email for formal complaints""",

                """The response must include compensation options without being asked:
                - Mention automatic refund of original shipping costs
                - Offer 20-30% discount on current order due to inconvenience
                - Provide credit for future purchase if customer prefers"""
            ]
        }
    },
    {
        "inputs": {
            "messages": [{"role": "user", "content": "How do I cancel my subscription?"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I can help you cancel your subscription right away. Here's how: 1) Click 'Account Settings' in the top right corner, 2) Select 'Subscription' from the menu, 3) Click 'Cancel Subscription' and confirm. Your access will continue until March 15, 2024 (the end of your current billing period), and you won't be charged again. All your saved data and preferences will be kept for 30 days if you decide to return. You're entitled to a prorated refund of $12.50 for the unused portion of this month. Alternatively, if you'd prefer, I can offer you a 50% discount for the next 3 months or help you switch to our Basic plan which is $4.99/month."
                }
            }]
        },
        "expectations": {
            "guidelines": [
                """The response must respect the customer's decision to cancel:
                - No guilt-inducing language or excessive retention attempts
                - No hiding the cancellation process behind multiple steps
                - Must not require phone calls if account was created online
                - Acknowledge their right to cancel immediately without questions""",

                """The response must provide complete cancellation information:
                - State the exact steps (maximum 3) to cancel online
                - Clarify the last billing date and when access ends
                - Explain what happens to their data, saved preferences, and history
                - Mention the 30-day reactivation window where data is preserved
                - Note any partial month refunds they're entitled to""",

                """The response may include ONE gentle retention offer:
                - Can mention a pause option instead of cancellation
                - Can offer a one-time discount or downgrade option
                - Must present this as optional, after the cancellation instructions
                - Must use soft language like 'alternatively' or 'if you'd prefer'"""
            ]
        }
    },
    {
        "inputs": {
            "messages": [{"role": "user", "content": "I was charged twice for my order #12345"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I see the duplicate charge on your order #12345, and I'll resolve this for you right away. I've initiated the refund for the duplicate charge of $157.99. For credit cards, you'll see this reflected in 3-5 business days, and you should see a pending reversal on your account within 24 hours. I'm sending you a confirmation email right now with the refund transaction ID (REF-789012) and all the details. Your case reference number is CS-456789 if you need to follow up. Since this occurred near month-end, if this causes any overdraft fees, please let us know - we'll reimburse up to $35 in bank fees. Our billing team's direct line is 1-800-555-0123 ext 2 if you need immediate assistance. This won't affect your credit or any future orders with us, and we're investigating our payment system to prevent this from happening again."
                }
            }]
        },
        "expectations": {
            "guidelines": [
                """The response must immediately validate the customer's concern:
                - Acknowledge the duplicate charge without skepticism
                - Must not ask for proof or screenshots initially
                - Express understanding of the inconvenience and potential financial impact
                - Take ownership with phrases like 'I'll resolve this for you right away'""",

                """The response must provide specific resolution details:
                - State exact refund timeline (e.g., '3-5 business days for credit cards, 5-7 for debit')
                - Mention that they'll see a pending reversal within 24 hours
                - Offer to send detailed confirmation email with transaction IDs
                - Provide a reference number for this billing dispute
                - Include the direct billing department contact for follow-up""",

                """The response must address potential concerns proactively:
                - If near month-end, acknowledge potential impact on their budget
                - Offer to provide a letter for their bank if overdraft fees occurred
                - Mention our overdraft reimbursement policy (up to $35)
                - Assure that this won't affect their credit or future orders
                - Note that we're investigating to prevent future occurrences"""
            ]
        }
    }
]

results = mlflow.genai.evaluate(
    data=customer_service_data,
    scorers=[ExpectationsGuidelines()]
)

ドキュメント抽出アプリ

ドキュメント抽出アプリケーションを評価するための実用的なガイドラインの例を次に示します。

抽出品質に関するグローバルガイドライン

Python
from mlflow.genai.scorers import Guidelines
import mlflow

# Define extraction accuracy standards
extraction_accuracy = Guidelines(
    name="extraction_accuracy",
    guidelines=[
        """Field extraction completeness and accuracy:
        - The response must extract ALL requested fields, using exact values from source
        - For ambiguous data, the response must extract the most likely value and include a confidence score
        - When multiple values exist for one field (e.g., multiple addresses), extract all and label them
        - Preserve original formatting for IDs, reference numbers, and codes (including leading zeros)
        - For missing fields, use null with reason: {"field": null, "reason": "not_found"} """,

        """Numerical and financial data handling:
        - Currency values must preserve exact decimal places as shown in source
        - Must differentiate between currencies if multiple are present (USD, EUR, etc.)
        - Percentage values must clarify if they're decimals (0.15) or percentages (15%)
        - For calculated fields (totals, tax), must match source exactly - no recalculation
        - Negative values must be preserved with proper notation (-$100 or ($100))""",

        """Entity recognition and validation:
        - Company names must be extracted exactly as written (including suffixes like Inc., LLC)
        - Person names must preserve original order and formatting
        - Must not merge similar entities (e.g., "John Smith" and "J. Smith" are kept separate)
        - Email addresses and phone numbers must be validated for basic format
        - Physical addresses must include all components present in source"""
    ]
)

format_compliance = Guidelines(
    name="output_format",
    guidelines="""Output structure must meet these enterprise data standards:

    JSON Structure Requirements:
    - Must be valid JSON that passes strict parsing
    - All field names must use snake_case consistently
    - Nested objects must maintain hierarchy from source document
    - Arrays must be used for multiple values, never concatenated strings

    Data Type Standards:
    - Dates: ISO 8601 format (YYYY-MM-DD) with timezone if available
    - Timestamps: ISO 8601 with time (YYYY-MM-DDTHH:MM:SSZ)
    - Currency: {"amount": 1234.56, "currency": "USD", "formatted": "$1,234.56"}
    - Phone: {"number": "+14155551234", "formatted": "(415) 555-1234", "type": "mobile"}
    - Boolean: true/false (not "yes"/"no" or 1/0)

    Metadata Requirements:
    - Include extraction_timestamp in UTC
    - Include source_page for multi-page documents
    - Include confidence_score (0-1) for each ML-extracted field
    - Include validation_flags array for any data quality issues detected"""
)

# Example document extraction data
extraction_tasks = [
    {
        "inputs": {
            "document_text": "Invoice #INV-2024-001\nDate: 2024-01-15\nBill To: Acme Corp\n123 Main St, Suite 100\nAnytown, CA 94000\n\nItems:\n- Widget Pro (SKU: WP-100) - Qty: 10 x $50.00 = $500.00\n- Service Fee - $100.00\n\nSubtotal: $600.00\nTax (8.75%): $52.50\nTotal: $652.50\n\nDue Date: 2024-02-15\nPayment Terms: Net 30",
            "fields_to_extract": ["invoice_number", "customer", "total_amount", "due_date", "line_items"]
        },
        "outputs": {
            "invoice_number": "INV-2024-001",
            "customer": {
                "name": "Acme Corp",
                "address": {
                    "street": "123 Main St, Suite 100",
                    "city": "Anytown",
                    "state": "CA",
                    "zip": "94000"
                }
            },
            "total_amount": {
                "amount": 652.50,
                "currency": "USD",
                "formatted": "$652.50"
            },
            "due_date": "2024-02-15",
            "line_items": [
                {
                    "description": "Widget Pro",
                    "sku": "WP-100",
                    "quantity": 10,
                    "unit_price": 50.00,
                    "total": 500.00
                },
                {
                    "description": "Service Fee",
                    "quantity": 1,
                    "unit_price": 100.00,
                    "total": 100.00
                }
            ],
            "extraction_timestamp": "2024-01-20T10:30:00Z",
            "source_page": 1,
            "confidence_score": 0.95
        }
    },
    {
        "inputs": {
            "document_text": "Contract between TechStart Inc. and CloudProvider LLC\nEffective Date: January 1, 2024\nContract ID: C-2024-789\n\nThis agreement outlines cloud hosting services...\nMonthly Fee: €5,000\nContract Term: 24 months\nCancellation: 90 days written notice required",
            "fields_to_extract": ["contract_id", "parties", "monthly_fee", "term_length"]
        },
        "outputs": {
            "contract_id": "C-2024-789",
            "parties": [
                {"name": "TechStart Inc.", "role": "customer"},
                {"name": "CloudProvider LLC", "role": "provider"}
            ],
            "monthly_fee": {
                "amount": 5000.00,
                "currency": "EUR",
                "formatted": "€5,000"
            },
            "term_length": {
                "duration": 24,
                "unit": "months"
            },
            "cancellation_notice": {
                "days": 90,
                "type": "written"
            },
            "extraction_timestamp": "2024-01-20T10:35:00Z",
            "confidence_score": 0.92
        }
    }
]

# Evaluate document extractions
results = mlflow.genai.evaluate(
    data=extraction_tasks,
    scorers=[extraction_accuracy, format_compliance]
)

ドキュメントの種類に関する行ごとのガイドライン

Python
from mlflow.genai.scorers import ExpectationsGuidelines
import mlflow

# Dataset with document-type specific guidelines
document_extraction_data = [
    {
        "inputs": {
            "document_type": "invoice",
            "document_text": "Invoice #INV-2024-001\nBill To: Acme Corp\nAmount: $1,234.56\nDue Date: 2024-03-15"
        },
        "outputs": {
            "invoice_number": "INV-2024-001",
            "customer": "Acme Corp",
            "total_amount": 1234.56,
            "due_date": "2024-03-15"
        },
        "expectations": {
            "guidelines": [
                """Invoice identification and classification:
                - Must extract invoice_number preserving exact format including prefixes/suffixes
                - Must identify invoice type (standard, credit memo, proforma) if specified
                - Must extract both invoice date and due date, calculating days until due
                - Must identify if this is a partial, final, or supplementary invoice
                - For recurring invoices, must extract frequency and period covered""",

                """Financial data extraction and validation:
                - Line items must be extracted as array with: description, quantity, unit_price, total
                - Must identify and separate: subtotal, tax amounts (with rates), shipping, discounts
                - Currency must be identified explicitly, not assumed to be USD
                - For discounts, must specify if percentage or fixed amount and what it applies to
                - Payment terms must be extracted (e.g., "Net 30", "2/10 Net 30")
                - Must flag any mathematical inconsistencies between line items and totals""",

                """Vendor and customer information:
                - Must extract complete billing and shipping addresses as separate objects
                - Company names must include any DBA ("doing business as") variations
                - Must extract tax IDs, business registration numbers if present
                - Contact information must be categorized (billing contact vs. delivery contact)
                - Must preserve any customer account numbers or reference codes"""
            ]
        }
    },
    {
        "inputs": {
            "document_type": "contract",
            "document_text": "This agreement between Party A and Party B commences on January 1, 2024..."
        },
        "outputs": {
            "parties": ["Party A", "Party B"],
            "effective_date": "2024-01-01",
            "term_length": "Not specified"
        },
        "expectations": {
            "guidelines": [
                """Party identification and roles:
                - Must extract all parties with their full legal names and entity types (Inc., LLC, etc.)
                - Must identify party roles (buyer/seller, licensee/licensor, employer/employee)
                - Must extract any parent company relationships or guarantors mentioned
                - Must capture all representatives, their titles, and authority to sign
                - Must identify jurisdiction for each party if specified""",

                """Critical dates and terms extraction:
                - Must differentiate between: execution date, effective date, and expiration date
                - Must extract notice periods for termination (e.g., "30 days written notice")
                - Must identify any automatic renewal clauses and their conditions
                - Must extract all milestone dates and deliverable deadlines
                - For amendments, must note which version/date of original contract is modified""",

                """Obligations and risk analysis:
                - Must extract all payment terms, amounts, and schedules
                - Must identify liability caps, indemnification clauses, and insurance requirements
                - Must flag any non-standard clauses that deviate from typical contracts
                - Must extract all conditions precedent and subsequent
                - Must identify dispute resolution mechanism (arbitration, litigation, jurisdiction)
                - Must extract any non-compete, non-solicitation, or confidentiality periods"""
            ]
        }
    },
    {
        "inputs": {
            "document_type": "medical_record",
            "document_text": "Patient: John Doe\nDOB: 1985-06-15\nDiagnosis: Type 2 Diabetes\nMedications: Metformin 500mg"
        },
        "outputs": {
            "patient_name": "John Doe",
            "date_of_birth": "1985-06-15",
            "diagnoses": ["Type 2 Diabetes"],
            "medications": [{"name": "Metformin", "dosage": "500mg"}]
        },
        "expectations": {
            "guidelines": [
                """HIPAA compliance and privacy protection:
                - Must never extract full SSN (only last 4 digits if needed for matching)
                - Must never include full insurance policy numbers or member IDs
                - Must redact or generalize sensitive mental health or substance abuse information
                - For minors, must flag records requiring additional consent for sharing
                - Must not extract genetic testing results without explicit permission flag""",

                """Clinical data extraction standards:
                - Diagnoses must use ICD-10 codes when available, with lay descriptions
                - Medications must include: generic name, brand name, dosage, frequency, route, start date
                - Must differentiate between active medications and discontinued/past medications
                - Allergies must specify type (drug, food, environmental) and reaction severity
                - Lab results must include: value, unit, reference range, abnormal flags
                - Vital signs must include measurement date/time and measurement conditions""",

                """Data quality and medical accuracy:
                - Must flag any potentially dangerous drug interactions if multiple meds listed
                - Must identify if vaccination records are up-to-date based on CDC guidelines
                - Must extract both chief complaint and final diagnosis separately
                - For chronic conditions, must note date of first diagnosis vs. most recent visit
                - Must preserve clinical abbreviations but also provide expansions
                - Must extract provider name, credentials, and NPI number if available"""
            ]
        }
    }
]

results = mlflow.genai.evaluate(
    data=document_extraction_data,
    scorers=[ExpectationsGuidelines()]
)

LLMジャッジを支援するモデルに関する情報

LLMジャッジは、Microsoft が運用する Azure OpenAI を含むサードパーティのサービスを使用して、AIアプリケーションを評価する場合があります。
Azure OpenAIの場合、Databricksは不正行為モニタリングをオプトアウトしているため、プロンプトや応答はAzure OpenAIに保存されません。
地域間処理が無効になっている場合、LLM評価者はワークスペースのDatabricks Geoでコンテンツを処理します。適切なGeo内モデルが利用できない場合、評価者はGeo制限エラーを返します。地域間処理が有効になっている場合、LLM評価者は他のGeosでコンテンツを処理できます。
パートナーを利用したAI機能を無効にすると、 LLMジャッジがパートナーを利用したモデルを呼び出すことができなくなります。独自のモデルを提供することで、LLM ジャッジを引き続き使用できます。
LLMジャッジは、顧客がAIエージェント/アプリケーションを評価するのを支援することを目的としており、LLMジャッジの出力はLLMをトレーニング、改善、またはファインチューニングするために使用すべきではありません。

その他のリソース

組み込みの LLM ジャッジを使用する- MLflow の他の研究に基づいた組み込みの LLM ジャッジを使用して品質を評価します
カスタム LLM ジャッジを作成- 特定のニーズに合わせてカスタムジャッジを構築します
人間のフィードバックに基づいてジャッジを調整- 品質基準に合わせて審査の精度を向上

ガイドラインジャッジの活用方法​

ガイドラインのしくみ​

例を実行するための前提条件​

Guidelines()審査員: グローバルガイドライン​

パラメーター​

ガイドライン審査員がアプリの入力と出力を解析する方法​

依頼​

リクエスト例​

応答​

応答例​

ExpectationsGuidelines()審査員: 行ごとのガイドライン​

いつ使用するか​

例​

戻り値​

効果的なガイドラインを書くためのベストプラクティス​

参照コンテキスト変数​

追加のガイドライン​

実社会の例​

顧客 サービス chatbot​

すべてのインタラクションに関するグローバルガイドライン​

特定のシナリオの行ごとのガイドライン​

ドキュメント抽出アプリ​

抽出品質に関するグローバルガイドライン​

ドキュメントの種類に関する行ごとのガイドライン​

LLMジャッジを支援するモデルに関する情報​

その他のリソース​