Custom judges
Custom LLM judges let you define complex and nuanced scoring guidelines for GenAI applications using natural language.
While MLflow built-in LLM judges offer excellent starting points for common quality dimensions, custom judges created using make_judge() give you full control over evaluation criteria.
Prompts and template variables
To create a judge, you provide a prompt with natural language instructions on how to assess the quality of your agent. make_judge() accepts template variables to access the agent's inputs, outputs, expected outputs or behaviors, and even complete traces.
Your instructions must include at least one template variable, but you don't need to use all of them.
{{ inputs }}- Input data provided to the agent{{ outputs }}- Output data generated by your agent{{ expectations }}- Ground truths or expected outcomes{{ trace }}- The complete execution trace of your agent
These are the only variables allowed. Custom variables like {{ question }} will throw validation errors in order to ensure consistent behavior and prevent template injection issues.
Trace-based judges
Trace-based judges analyze execution traces to understand what happened during agent execution. They autonomously explore traces using Model Context Protocol (MCP) tools and can:
- Validate tool usage patterns
- Identify performance bottlenecks
- Investigate execution failures
- Verify multi-step workflows
The following example defines a judge that assesses tool calling correctness by analyzing traces:
from mlflow.genai.judges import make_judge
# Agent judge for tool calling correctness
tool_usage_judge = make_judge(
name="tool_usage_validator",
instructions=(
"Analyze the {{ trace }} to verify correct tool usage.\n\n"
"Check that the agent selected appropriate tools for the user's request "
"and called them with correct parameters.\n"
"Rate as: 'correct' or 'incorrect'"
),
model="databricks:/databricks-gpt-5-mini" # Required for trace-based judges
)
For trace-based judges to analyze the full trace, the model argument must be specified in make_judge().
For a complete tutorial, see Create a custom judge using make_judge().
Model requirements for trace-based judges
Trace-based judges require a model capable of trace analysis. The model can be served by:
Recommended models:
databricks:/databricks-gpt-5-minidatabricks:/databricks-gpt-5databricks:/databricks-gpt-oss-120bdatabricks:/databricks-claude-opus-4-1
Best practices for writing judge instructions
Be specific about expected output format. Your instructions should clearly specify what format the judge should return:
- Categorical responses: List specific values (for example, 'fully_resolved', 'partially_resolved', 'needs_follow_up')
- Boolean responses: Explicitly state the judge should return
trueorfalse - Numeric scores: Specify the scoring range and what each score means
Break down complex evaluations. For complex evaluation tasks, structure your instructions into clear sections:
- What to evaluate
- What information to examine
- How to make the judgment
- What format to return
Align judges with human experts
The base judge is a starting point. As you gather expert feedback on your application's outputs, you can align the LLM judges to the feedback to further improve judge accuracy. See Align judges with humans.
Next steps
See Create a custom judge for a hands-on tutorial that demonstrates both standard and trace-based judges.