Built-in LLM Judges
Overview
MLflow provides research-backed LLM judges for common quality checks. These judges are Scorers that leverage Large Language Models to assess your application's outputs against quality criteria like safety, relevance, and correctness.
LLM Judges are a type of MLflow Scorer that uses Large Language Models for evaluation. They can be used directly with the Evaluation Harness and production monitoring service.
Judge | Arguments | Requires ground truth | What it evaluates? |
|---|---|---|---|
| No | Is the response directly relevant to the user's request? | |
| No | Is the retrieved context directly relevant to the user's request? | |
| No | Is the content free from harmful, offensive, or toxic material? | |
| No | Is the response grounded in the information provided in the context (e.g., the app is not hallucinating)? | |
| No | Does the response meet specified natural language criteria? | |
| No (but needs guidelines in expectations) | Does the response meet per-example natural language criteria? | |
| Yes | Is the response correct as compared to the provided ground truth? | |
| Yes | Does the context provide all necessary information to generate a response that includes the ground truth facts? |
Prerequisites for running the examples
-
Install MLflow and required packages
Bashpip install --upgrade "mlflow[databricks]>=3.1.0" -
Create an MLflow experiment by following the setup your environment quickstart.
How to use prebuilt judges
1. Directly via the SDK
You can use judges directly in your evaluation workflow. Below is an example using the RetrievalGroundedness judge:
from mlflow.genai.scorers import RetrievalGroundedness
groundedness_judge = RetrievalGroundedness()
feedback = groundedness_judge(
inputs={"request": "What is the capital of France?"},
outputs={"response": "Paris", "context": "Paris is the capital of France."}
)
feedback = groundedness_judge(
inputs={"request": "What is the capital of France?"},
outputs={"response": "Paris", "context": "Paris is known for its Eiffel Tower."}
)
2. Usage with mlflow.evaluate()
You can use judges directly with MLflow's evaluation framework.
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "Paris is the magnificent capital city of France, a stunning metropolis known worldwide for its iconic Eiffel Tower, rich cultural heritage, beautiful architecture, world-class museums like the Louvre, and its status as one of Europe's most important political and economic centers. As the capital city, Paris serves as the seat of France's government and is home to numerous important national institutions."
},
"expectations": {
"expected_facts": ["Paris is the capital of France."],
},
},
]
from mlflow.genai.scorers import Correctness
eval_results = mlflow.genai.evaluate(data=eval_dataset, scorers=[Correctness])
Next Steps
- Use built-in LLM judges in evaluation - Get started with built-in LLM judges
- Create custom LLM judges - Build judges tailored to your specific needs
- Run evaluations - Apply judges to systematically assess your application