LLM-based scorers
Overview
Judges are MLflow's SDK/API building blocks for LLM-based quality assessment. Each judge uses a specially tuned, Databricks-hosted LLM model designed to perform GenAI quality assessments.
Think of a judge as an AI assistant specialized in quality assessment - they read your app's outputs and make assessments based on criteria you define. For example, they can understand that give me healthy food options
is the same query as food to keep me fit
are very similar queries.
While judges can be used as standalone APIs, judges must be wrapped in Scorers for use by the Evaluation Harness and production monitoring service.
When to use judges
Use judges when you need to evaluate plain language inputs or outputs:
- Semantic correctness: "Does this answer the question correctly?"
- Style and tone: "Is this appropriate for our brand voice?"
- Safety and compliance: "Does this follow our content guidelines?"
- Relative quality: "Which response is more helpful?"
Use custom, code-based scorers instead for:
- Exact matching: Checking for specific keywords
- Format validation: JSON structure, length limits
- Performance metrics: Latency, token usage
Deeper dive into judges
For detailed information about specific judges:
Predefined judges
MLflow provides research-validated judges for common use cases:
from mlflow.genai.judges import (
is_safe, # Content safety
is_relevant, # Query relevance
is_grounded, # RAG grounding
is_correct, # Factual accuracy
is_context_sufficient # Retrieval quality
)
See predefined judges reference for detailed documentation.
Custom judges
Build domain-specific judges using two approaches:
-
Guidelines-based (recommended starting point) - Natural language pass/fail criteria that are easy to explain to stakeholders. Best for compliance checks, style guides, or information inclusion/exclusion.
-
Prompt-based - Full prompt customization for complex evaluations. Use when you need multiple output values (e.g., "great", "ok", "bad") or criteria that can't be expressed as pass/fail guidelines.
Judge accuracy
Databricks continuously improves judge quality through:
- Research validation against human expert judgment
- Metrics tracking: Cohen's Kappa, accuracy, F1 score
- Diverse testing on academic and real-world datasets
See Databricks blog on LLM judge improvements for details.
Information about the models powering LLM judges
- LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
- For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
- For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
- Disabling partner-powered AI assistive features prevents the LLM judge from calling partner-powered models.
- LLM judges are intended to help customers evaluate their GenAI agents/applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.
Next steps
How-to guides
- Use predefined LLM scorers that wrap built-in judges
- Create guideline-based judges using natural language criteria
- Build custom prompt-based judges for complex evaluation
Concepts
- Predefined judges reference - Detailed documentation of all built-in judges
- Guidelines-based judges - How guideline evaluation works
- Prompt-based judges - Creating custom evaluation prompts
- Scorers - How judges integrate with the evaluation system