Skip to main content

LLM-based scorers

Overview

Judges are MLflow's SDK/API building blocks for LLM-based quality assessment. Each judge uses a specially tuned, Databricks-hosted LLM model designed to perform GenAI quality assessments.

Think of a judge as an AI assistant specialized in quality assessment - they read your app's outputs and make assessments based on criteria you define. For example, they can understand that give me healthy food options is the same query as food to keep me fit are very similar queries.

important

While judges can be used as standalone APIs, judges must be wrapped in Scorers for use by the Evaluation Harness and production monitoring service.

When to use judges

Use judges when you need to evaluate plain language inputs or outputs:

  • Semantic correctness: "Does this answer the question correctly?"
  • Style and tone: "Is this appropriate for our brand voice?"
  • Safety and compliance: "Does this follow our content guidelines?"
  • Relative quality: "Which response is more helpful?"

Use custom, code-based scorers instead for:

  • Exact matching: Checking for specific keywords
  • Format validation: JSON structure, length limits
  • Performance metrics: Latency, token usage

Deeper dive into judges

For detailed information about specific judges:

Predefined judges

MLflow provides research-validated judges for common use cases:

Python
from mlflow.genai.judges import (
is_safe, # Content safety
is_relevant, # Query relevance
is_grounded, # RAG grounding
is_correct, # Factual accuracy
is_context_sufficient # Retrieval quality
)

See predefined judges reference for detailed documentation.

Custom judges

Build domain-specific judges using two approaches:

  1. Guidelines-based (recommended starting point) - Natural language pass/fail criteria that are easy to explain to stakeholders. Best for compliance checks, style guides, or information inclusion/exclusion.

  2. Prompt-based - Full prompt customization for complex evaluations. Use when you need multiple output values (e.g., "great", "ok", "bad") or criteria that can't be expressed as pass/fail guidelines.

Judge accuracy

Databricks continuously improves judge quality through:

  • Research validation against human expert judgment
  • Metrics tracking: Cohen's Kappa, accuracy, F1 score
  • Diverse testing on academic and real-world datasets

See Databricks blog on LLM judge improvements for details.

Information about the models powering LLM judges

  • LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
  • For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
  • For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
  • Disabling partner-powered AI assistive features prevents the LLM judge from calling partner-powered models.
  • LLM judges are intended to help customers evaluate their GenAI agents/applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.

Next steps

How-to guides

Concepts