Skip to main content

Built-in LLM Judges

Overview

MLflow provides research-backed LLM judges for common quality checks. These judges are Scorers that leverage Large Language Models to assess your application's outputs against quality criteria like safety, relevance, and correctness.

important

LLM Judges are a type of MLflow Scorer that uses Large Language Models for evaluation. They can be used directly with the Evaluation Harness and production monitoring service.

Judge

Arguments

Requires ground truth

What it evaluates?

RelevanceToQuery

inputs, outputs

No

Is the response directly relevant to the user's request?

RetrievalRelevance

inputs, outputs

No

Is the retrieved context directly relevant to the user's request?

Safety

inputs, outputs

No

Is the content free from harmful, offensive, or toxic material?

RetrievalGroundedness

inputs, outputs

No

Is the response grounded in the information provided in the context (e.g., the app is not hallucinating)?

Guidelines

inputs, outputs

No

Does the response meet specified natural language criteria?

ExpectationsGuidelines

inputs, outputs, expectations

No (but needs guidelines in expectations)

Does the response meet per-example natural language criteria?

Correctness

inputs, outputs, expectations

Yes

Is the response correct as compared to the provided ground truth?

RetrievalSufficiency

inputs, outputs, expectations

Yes

Does the context provide all necessary information to generate a response that includes the ground truth facts?

Prerequisites for running the examples

  1. Install MLflow and required packages

    Bash
    pip install --upgrade "mlflow[databricks]>=3.1.0"
  2. Create an MLflow experiment by following the setup your environment quickstart.

How to use prebuilt judges

1. Directly via the SDK

You can use judges directly in your evaluation workflow. Below is an example using the RetrievalGroundedness judge:

Python
from mlflow.genai.scorers import RetrievalGroundedness

groundedness_judge = RetrievalGroundedness()

feedback = groundedness_judge(
inputs={"request": "What is the capital of France?"},
outputs={"response": "Paris", "context": "Paris is the capital of France."}
)

feedback = groundedness_judge(
inputs={"request": "What is the capital of France?"},
outputs={"response": "Paris", "context": "Paris is known for its Eiffel Tower."}
)

2. Usage with mlflow.evaluate()

You can use judges directly with MLflow's evaluation framework.

Python
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "Paris is the magnificent capital city of France, a stunning metropolis known worldwide for its iconic Eiffel Tower, rich cultural heritage, beautiful architecture, world-class museums like the Louvre, and its status as one of Europe's most important political and economic centers. As the capital city, Paris serves as the seat of France's government and is home to numerous important national institutions."
},
"expectations": {
"expected_facts": ["Paris is the capital of France."],
},
},
]


from mlflow.genai.scorers import Correctness


eval_results = mlflow.genai.evaluate(data=eval_dataset, scorers=[Correctness])

Next Steps