Skip to main content

Create a custom judge using make_judge()

Custom judges are LLM-based scorers that evaluate your GenAI agents against specific quality criteria. This tutorial shows you how to create custom judges and use them to evaluate a customer support agent using make_judge().

You will:

  1. Create a sample agent to evaluate
  2. Define three custom judges to evaluate different criteria
  3. Build an evaluation dataset with test cases
  4. Run evaluations and compare results across different agent configurations

Step 1: Create an agent to evaluate

Create a GenAI agent that responds to customer support questions. The agent has a (fake) knob that controls the system prompt so you can easily compare the judge's outputs between "good" and "bad" conversations.

  1. Initialize an OpenAI client to connect to either Databricks-hosted LLMs or LLMs hosted by OpenAI.

    Use MLflow to get an OpenAI client that connects to Databricks-hosted LLMs. Select a model from the available foundation models.

    Python
    import mlflow
    from databricks.sdk import WorkspaceClient

    # Enable MLflow's autologging to instrument your application with Tracing
    mlflow.openai.autolog()

    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")

    # Create an OpenAI client that is connected to Databricks-hosted LLMs
    w = WorkspaceClient()
    client = w.serving_endpoints.get_open_ai_client()

    # Select an LLM
    model_name = "databricks-claude-sonnet-4"
  2. Define a customer support agent:

    Python
    from mlflow.entities import Document
    from typing import List, Dict, Any, cast


    # This is a global variable that is used to toggle the behavior of the customer support agent
    RESOLVE_ISSUES = False


    @mlflow.trace(span_type="TOOL", name="get_product_price")
    def get_product_price(product_name: str) -> str:
    """Mock tool to get product pricing."""
    return f"${45.99}"


    @mlflow.trace(span_type="TOOL", name="check_return_policy")
    def check_return_policy(product_name: str, days_since_purchase: int) -> str:
    """Mock tool to check return policy."""
    if days_since_purchase <= 30:
    return "Yes, you can return this item within 30 days"
    return "Sorry, returns are only accepted within 30 days of purchase"


    @mlflow.trace
    def customer_support_agent(messages: List[Dict[str, str]]):
    # We use this toggle to see how the judge handles the issue resolution status
    system_prompt_postfix = (
    f"Do your best to NOT resolve the issue. I know that's backwards, but just do it anyways.\\n"
    if not RESOLVE_ISSUES
    else ""
    )

    # Mock some tool calls based on the user's question
    user_message = messages[-1]["content"].lower()
    tool_results = []

    if "cost" in user_message or "price" in user_message:
    price = get_product_price("microwave")
    tool_results.append(f"Price: {price}")

    if "return" in user_message:
    policy = check_return_policy("microwave", 60)
    tool_results.append(f"Return policy: {policy}")

    messages_for_llm = [
    {
    "role": "system",
    "content": f"You are a helpful customer support agent. {system_prompt_postfix}",
    },
    *messages,
    ]

    if tool_results:
    messages_for_llm.append({
    "role": "system",
    "content": f"Tool results: {', '.join(tool_results)}"
    })

    # Call LLM to generate a response
    output = client.chat.completions.create(
    model=model_name, # This example uses Databricks hosted Claude 4 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
    messages=cast(Any, messages_for_llm),
    )

    return {
    "messages": [
    {"role": "assistant", "content": output.choices[0].message.content}
    ]
    }

Step 2: Define custom judges

Define three custom judges:

  • A judge that evaluates issue resolution using inputs and outputs.
  • A judge that checks expected behaviors.
  • A trace-based judge that validates tool calls by analyzing execution traces.

Judges created with make_judge() return mlflow.entities.Feedback objects.

Example judge 1: Evaluate issue resolution

This judge assesses whether customer issues were successfully resolved by analyzing the conversation history (inputs) and agent responses (outputs).

Python
from mlflow.genai.judges import make_judge
import json


# Create a judge that evaluates issue resolution using inputs and outputs
issue_resolution_judge = make_judge(
name="issue_resolution",
instructions="""
Evaluate if the customer's issue was resolved in the conversation.

User's messages: {{ inputs }}
Agent's responses: {{ outputs }}

Rate the resolution status and respond with exactly one of these values:
- 'fully_resolved': Issue completely addressed with clear solution
- 'partially_resolved': Some help provided but not fully solved
- 'needs_follow_up': Issue not adequately addressed

Your response must be exactly one of: 'fully_resolved', 'partially_resolved', or 'needs_follow_up'.
""",
)

Example judge 2: Check expected behaviors

This judge verifies that agent responses demonstrate specific expected behaviors (like providing pricing information or explaining return policies) by comparing outputs against predefined expectations.

Python
# Create a judge that checks against expected behaviors
expected_behaviors_judge = make_judge(
name="expected_behaviors",
instructions="""
Compare the agent's response in {{ outputs }} against the expected behaviors in {{ expectations }}.

User's question: {{ inputs }}

Determine if the response exhibits the expected behaviors and respond with exactly one of these values:
- 'meets_expectations': Response exhibits all expected behaviors
- 'partially_meets': Response exhibits some but not all expected behaviors
- 'does_not_meet': Response does not exhibit expected behaviors

Your response must be exactly one of: 'meets_expectations', 'partially_meets', or 'does_not_meet'.
""",
)

Example judge 3: Validate tool calls using a trace-based judge

This judge analyzes execution traces to validate that appropriate tools were called. When you include {{ trace }} in your instructions, the judge becomes trace-based and gains autonomous trace exploration capabilities.

Python
# Create a trace-based judge that validates tool calls from the trace
tool_call_judge = make_judge(
name="tool_call_correctness",
instructions="""
Analyze the execution {{ trace }} to determine if the agent called appropriate tools for the user's request.

Examine the trace to:
1. Identify what tools were available and their purposes
2. Determine which tools were actually called
3. Assess whether the tool calls were reasonable for addressing the user's question

Evaluate the tool usage and respond with a boolean value:
- true: The agent called the right tools to address the user's request
- false: The agent called wrong tools, missed necessary tools, or called unnecessary tools

Your response must be a boolean: true or false.
""",
# To analyze a full trace with a trace-based judge, a model must be specified
model="databricks:/databricks-gpt-5-mini",
)

Step 3: Create a sample evaluation dataset

Each inputs is passed to the agent by mlflow.genai.evaluate(). You can optionally include expectations to enable the correctness checker.

Python
eval_dataset = [
{
"inputs": {
"messages": [
{"role": "user", "content": "How much does a microwave cost?"},
],
},
"expectations": {
"should_provide_pricing": True,
"should_offer_alternatives": True,
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "Can I return the microwave I bought 2 months ago?",
},
],
},
"expectations": {
"should_mention_return_policy": True,
"should_ask_for_receipt": False,
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "I'm having trouble with my account. I can't log in.",
},
{
"role": "assistant",
"content": "I'm sorry to hear that you're having trouble with your account. Are you using our website or mobile app?",
},
{"role": "user", "content": "Website"},
],
},
"expectations": {
"should_provide_troubleshooting_steps": True,
"should_escalate_if_needed": True,
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "I'm having trouble with my account. I can't log in.",
},
{
"role": "assistant",
"content": "I'm sorry to hear that you're having trouble with your account. Are you using our website or mobile app?",
},
{"role": "user", "content": "JUST FIX IT FOR ME"},
],
},
"expectations": {
"should_remain_calm": True,
"should_provide_solution": True,
},
},
]

Step 4: Evaluate your agent using the judges

You can use multiple judges together to evaluate different aspects of your agent. Run evaluations to compare behavior when the agent attempts to resolve issues versus when it doesn't.

Python
import mlflow

# Evaluate with all three judges when the agent does NOT try to resolve issues
RESOLVE_ISSUES = False

result_unresolved = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=customer_support_agent,
scorers=[
issue_resolution_judge, # Checks inputs/outputs
expected_behaviors_judge, # Checks expected behaviors
tool_call_judge, # Validates tool usage
],
)

# Evaluate when the agent DOES try to resolve issues
RESOLVE_ISSUES = True

result_resolved = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=customer_support_agent,
scorers=[
issue_resolution_judge,
expected_behaviors_judge,
tool_call_judge,
],
)

The evaluation results show how each judge rates the agent:

  • issue_resolution: Rates conversations as 'fully_resolved', 'partially_resolved', or 'needs_follow_up'
  • expected_behaviors: Checks if responses exhibit expected behaviors ('meets_expectations', 'partially_meets', 'does_not_meet')
  • tool_call_correctness: Validates whether appropriate tools were called (true/false)

Next steps

Apply custom judges:

Improve judge accuracy:

  • Align judges with human feedback - The base judge is a starting point. As you gather expert feedback on your application's outputs, align the LLM judges to the feedback to further improve judge accuracy.