Skip to main content

Custom judges

Custom LLM judges let you define complex and nuanced scoring guidelines for GenAI applications using natural language.

While MLflow built-in LLM judges offer excellent starting points for common quality dimensions, custom judges created using make_judge() give you full control over evaluation criteria.

Prompts and template variables

To create a judge, you provide a prompt with natural language instructions on how to assess the quality of your agent. make_judge() accepts template variables to access the agent's inputs, outputs, expected outputs or behaviors, and even complete traces.

Your instructions must include at least one template variable, but you don't need to use all of them.

  • {{ inputs }} - Input data provided to the agent
  • {{ outputs }} - Output data generated by your agent
  • {{ expectations }} - Ground truths or expected outcomes
  • {{ trace }} - The complete execution trace of your agent

These are the only variables allowed. Custom variables like {{ question }} will throw validation errors in order to ensure consistent behavior and prevent template injection issues.

Trace-based judges

Trace-based judges analyze execution traces to understand what happened during agent execution. They autonomously explore traces using Model Context Protocol (MCP) tools and can:

  • Validate tool usage patterns
  • Identify performance bottlenecks
  • Investigate execution failures
  • Verify multi-step workflows

The following example defines a judge that assesses tool calling correctness by analyzing traces:

Python
from mlflow.genai.judges import make_judge

# Agent judge for tool calling correctness
tool_usage_judge = make_judge(
name="tool_usage_validator",
instructions=(
"Analyze the {{ trace }} to verify correct tool usage.\n\n"
"Check that the agent selected appropriate tools for the user's request "
"and called them with correct parameters.\n"
"Rate as: 'correct' or 'incorrect'"
),
model="databricks:/databricks-gpt-5-mini" # Required for trace-based judges
)

For trace-based judges to analyze the full trace, the model argument must be specified in make_judge().

For a complete tutorial, see Create a custom judge using make_judge().

Model requirements for trace-based judges

Trace-based judges require a model capable of trace analysis. The model can be served by:

Recommended models:

  • databricks:/databricks-gpt-5-mini
  • databricks:/databricks-gpt-5
  • databricks:/databricks-gpt-oss-120b
  • databricks:/databricks-claude-opus-4-1

Best practices for writing judge instructions

Be specific about expected output format. Your instructions should clearly specify what format the judge should return:

  • Categorical responses: List specific values (for example, 'fully_resolved', 'partially_resolved', 'needs_follow_up')
  • Boolean responses: Explicitly state the judge should return true or false
  • Numeric scores: Specify the scoring range and what each score means

Break down complex evaluations. For complex evaluation tasks, structure your instructions into clear sections:

  • What to evaluate
  • What information to examine
  • How to make the judgment
  • What format to return

Align judges with human experts

The base judge is a starting point. As you gather expert feedback on your application's outputs, you can align the LLM judges to the feedback to further improve judge accuracy. See Align judges with humans.

Next steps

See Create a custom judge for a hands-on tutorial that demonstrates both standard and trace-based judges.