Custom judges

Custom LLM judges let you define complex and nuanced scoring guidelines for GenAI applications using natural language.

While MLflow built-in LLM judges offer excellent starting points for common quality dimensions, custom judges created using make_judge() give you full control over evaluation criteria.

Prompts and template variables

To create a judge, you provide a prompt with natural language instructions on how to assess the quality of your agent. make_judge() accepts template variables to access the agent's inputs, outputs, expected outputs or behaviors, and even complete traces.

Your instructions must include at least one template variable, but you don't need to use all of them.

{{ inputs }} - Input data provided to the agent
{{ outputs }} - Output data generated by your agent
{{ expectations }} - Ground truths or expected outcomes
{{ trace }} - The complete execution trace of your agent

These are the only variables allowed. Custom variables like {{ question }} will throw validation errors in order to ensure consistent behavior and prevent template injection issues.

Trace-based judges

Trace-based judges analyze execution traces to understand what happened during agent execution. They autonomously explore traces using Model Context Protocol (MCP) tools and can:

Validate tool usage patterns
Identify performance bottlenecks
Investigate execution failures
Verify multi-step workflows

The following example defines a judge that assesses tool calling correctness by analyzing traces:

Python
from mlflow.genai.judges import make_judge

# Agent judge for tool calling correctness
tool_usage_judge = make_judge(
    name="tool_usage_validator",
    instructions=(
        "Analyze the {{ trace }} to verify correct tool usage.\n\n"
        "Check that the agent selected appropriate tools for the user's request "
        "and called them with correct parameters.\n"
        "Rate as: 'correct' or 'incorrect'"
    ),
    model="databricks:/databricks-gpt-5-mini"  # Required for trace-based judges
)

For trace-based judges to analyze the full trace, the model argument must be specified in make_judge().

For a complete tutorial, see Create a custom judge using make_judge().

Model requirements for trace-based judges

Trace-based judges require a model capable of trace analysis. The model can be served by:

Recommended models:

databricks:/databricks-gpt-5-mini
databricks:/databricks-gpt-5
databricks:/databricks-gpt-oss-120b
databricks:/databricks-claude-opus-4-1

Best practices for writing judge instructions

Be specific about expected output format. Your instructions should clearly specify what format the judge should return:

Categorical responses: List specific values (for example, 'fully_resolved', 'partially_resolved', 'needs_follow_up')
Boolean responses: Explicitly state the judge should return true or false
Numeric scores: Specify the scoring range and what each score means

Break down complex evaluations. For complex evaluation tasks, structure your instructions into clear sections:

What to evaluate
What information to examine
How to make the judgment
What format to return

Align judges with human experts

The base judge is a starting point. As you gather expert feedback on your application's outputs, you can align the LLM judges to the feedback to further improve judge accuracy. See Align judges with humans.

Next steps

See Create a custom judge for a hands-on tutorial that demonstrates both standard and trace-based judges.

Prompts and template variables​

Trace-based judges​

Model requirements for trace-based judges​

Best practices for writing judge instructions​

Align judges with human experts​

Next steps​