Skip to main content

Scorers and LLM judges

Scorers are a key component of the MLflow GenAI evaluation framework. They provide a unified interface to define evaluation criteria for your models, agents, and applications. Like their name suggests, scorers score how well your application did based on the evaluation criteria. This could be a pass/fail, true/false, numerical value, or a categorical value.

You can use the same scorer for evaluation in development and monitoring in production to keep evaluation consistent throughout the application lifecycle.

Choose the right type of scorer depending on how much customization and control you need. Each approach builds on the previous one, adding more complexity and control.

Start with built-in judges for quick evaluation. As your needs evolve, build custom LLM judges for domain-specific criteria and create custom code-based scorers for deterministic business logic.

Approach

Level of customization

Use cases

Built-in judges

Minimal

Quickly try LLM evaluation with built-in scorers such as Correctness and RetrievalGroundedness.

Guidelines judges

Moderate

A built-in judge that check whether responses pass or fail custom natural-language rules, such as style or factuality guidelines.

Custom judges

Full

Create fully customized LLM judges with detailed evaluation criteria and feedback optimization.

Capable of returning numerical scores, categories, or boolean values.

Code-based scorers

Full

Programmatic and deterministic scorers that evaluate things like exact matching, format validation, and performance metrics.

The following screenshot shows the results from the built-in LLM judge Safety and a custom scorer exact_match:

Example metrics from scorers

How scorers work

A scorer receives a Trace from either evaluate() or the monitoring service. It then does the following:

  1. Parses the trace to extract specific fields and data that are used to assess quality
  2. Runs the scorer to perform the quality assessment based on the extracted fields and data
  3. Returns the quality assessment as Feedback to attach to the trace

Evaluation traces

Evaluation UI

LLMs as judges

LLM judges are a type of MLflow Scorer that uses Large Language Models for quality assessment.

Think of a judge as an AI assistant specialized in quality assessment. It can evaluate your app's inputs, outputs, and even explore the entire execution trace to make assessments based on criteria you define. For example, a judge can understand that give me healthy food options and food to keep me fit are similar queries.

note

Judges are a type of scorer that use LLMs for evaluation. Use them directly with mlflow.genai.evaluate() or wrap them in custom scorers for advanced scoring logic.

Built-in LLM judges

MLflow provides research-validated judges for common use cases:

Judge

Arguments

Requires ground truth

What it evaluates

RelevanceToQuery

inputs, outputs

No

Is the response directly relevant to the user's request?

RetrievalRelevance

inputs, outputs

No

Is the retrieved context directly relevant to the user's request?

Safety

inputs, outputs

No

Is the content free from harmful, offensive, or toxic material?

RetrievalGroundedness

inputs, outputs

No

Is the response grounded in the information provided in the context? Is the agent hallucinating?

Correctness

inputs, outputs, expectations

Yes

Is the response correct as compared to the provided ground truth?

RetrievalSufficiency

inputs, outputs, expectations

Yes

Does the context provide all necessary information to generate a response that includes the ground truth facts?

Guidelines

inputs, outputs

No

Does the response meet specified natural language criteria?

ExpectationsGuidelines

inputs, outputs, expectations

No (but needs guidelines in expectations)

Does the response meet per-example natural language criteria?

Custom LLM judges

In addition to the built-in judges, MLflow makes it easy to create your own judges with custom prompts and instructions.

Use custom LLM judges when you need to define specialized evaluation tasks, need more control over grades or scores (not just pass/fail), or need to validate that your agent made appropriate decisions and performed operations correctly for your specific use case.

See Custom judges.

Once you've created custom judges, you can further improve their accuracy by aligning them with human feedback.

Select the LLM that powers the judge

By default, each judge uses a Databricks-hosted LLM designed to perform GenAI quality assessments. You can change the judge model by using the model argument in the judge definition. Specify the model in the format <provider>:/<model-name>. For example:

Python
from mlflow.genai.scorers import Correctness

Correctness(model="databricks:/databricks-gpt-5-mini")

For a list of supported models, see the MLflow documentation.

Information about the models powering LLM judges

  • LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
  • For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
  • For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
  • Disabling Partner-powered AI features prevents the LLM judge from calling partner-powered models. You can still use LLM judges by providing your own model.
  • LLM judges are intended to help customers evaluate their GenAI agents/applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.

Judge accuracy

Databricks continuously improves judge quality through:

  • Research validation against human expert judgment
  • Metrics tracking: Cohen's Kappa, accuracy, F1 score
  • Diverse testing on academic and real-world datasets

See Databricks blog on LLM judge improvements for details.

Code-based scorers

Custom code-based scorers offer the ultimate flexibility to define precisely how your GenAI application's quality is measured. You can define evaluation metrics tailored to your specific business use case, whether based on simple heuristics, advanced logic, or programmatic evaluations.

Use custom scorers for the following scenarios:

  1. Defining a custom heuristic or code-based evaluation metric.
  2. Customizing how the data from your app's trace is mapped to built-in LLM judges.
  3. Using your own LLM (rather than a Databricks-hosted LLM judge) for evaluation.
  4. Any other use cases where you need more flexibility and control than provided by custom LLM judges.

See Create custom code-based scorers.