Correctness judge & scorer
The judges.is_correct()
predefined judge assesses whether your GenAI application's response is factually correct by comparing it against provided ground truth information (expected_facts
or expected_response
).
This judge is available through the predefined Correctness
scorer for evaluating application responses against known correct answers.
API Signature
For details, see mlflow.genai.judges.is_correct()
.
from mlflow.genai.judges import is_correct
def is_correct(
*,
request: str, # User's question or query
response: str, # Application's response to evaluate
expected_facts: Optional[list[str]], # List of expected facts (provide either expected_response or expected_facts)
expected_response: Optional[str] = None, # Ground truth response (provide either expected_response or expected_facts)
name: Optional[str] = None, # Optional custom name for display in the MLflow UIs
model: Optional[str] = None, # Optional LiteLLM compatible custom judge model
) -> mlflow.entities.Feedback:
"""Returns Feedback with 'yes' or 'no' value and a rationale"""
By default, this judge uses a specially tuned, Databricks-hosted LLM model designed to perform GenAI quality assessments. You can change the judge model by using the model argument in the scorer definition. The model must be specified in the format <provider>:/<model-name>
, where provider is a LiteLLM-compatible model provider. If you use databricks
as the model provider, the model name is the same as the serving endpoint name.
Prerequisites for running the examples
-
Install MLflow and required packages
Bashpip install --upgrade "mlflow[databricks]>=3.4.0"
-
Create an MLflow experiment by following the setup your environment quickstart.
Direct SDK Usage
from mlflow.genai.judges import is_correct
# Example 1: Response contains expected facts
feedback = is_correct(
request="What is MLflow?",
response="MLflow is an open-source platform for managing the ML lifecycle.",
expected_facts=[
"MLflow is open-source",
"MLflow is a platform for ML lifecycle"
]
)
print(feedback.value) # "yes"
print(feedback.rationale) # Explanation of correctness
# Example 2: Response missing or contradicting facts
feedback = is_correct(
request="When was MLflow released?",
response="MLflow was released in 2017.",
expected_facts=["MLflow was released in June 2018"]
)
print(feedback.value) # "no"
print(feedback.rationale) # Explanation of what's incorrect
# Example 3: Custom judge model
feedback = is_correct(
request="When was MLflow released?",
response="MLflow was released in 2017.",
expected_facts=["MLflow was released in June 2018"],
model="databricks:/databricks-gpt-oss-120b",
)
Using the prebuilt scorer
The is_correct
judge is available through the Correctness
prebuilt scorer.
Requirements:
- Trace requirements:
inputs
andoutputs
must be on the Trace's root span - Ground-truth labels: Required - must provide either
expected_facts
orexpected_response
in theexpectations
dictionary
from mlflow.genai.scorers import Correctness
# Create evaluation dataset with ground truth
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "Paris is the magnificent capital city of France, known for the Eiffel Tower and rich culture."
},
"expectations": {
"expected_facts": ["Paris is the capital of France."]
},
},
{
"inputs": {"query": "What are the main components of MLflow?"},
"outputs": {
"response": "MLflow has four main components: Tracking, Projects, Models, and Registry."
},
"expectations": {
"expected_facts": [
"MLflow has four main components",
"Components include Tracking",
"Components include Projects",
"Components include Models",
"Components include Registry"
]
},
},
{
"inputs": {"query": "When was MLflow released?"},
"outputs": {
"response": "MLflow was released in 2017 by Databricks."
},
"expectations": {
"expected_facts": ["MLflow was released in June 2018"]
},
}
]
# Run evaluation with Correctness scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Correctness(
model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model.
)
]
)
Alternative: Using expected_response
You can also use expected_response
instead of expected_facts
:
eval_dataset_with_response = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": {
"response": "MLflow is an open-source platform for managing the ML lifecycle."
},
"expectations": {
"expected_response": "MLflow is an open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment."
},
}
]
# Run evaluation with expected_response
eval_results = mlflow.genai.evaluate(
data=eval_dataset_with_response,
scorers=[Correctness()]
)
Using expected_facts
is recommended over expected_response
as it allows for more flexible evaluation - the response doesn't need to match word-for-word, just contain the key facts.
Using in a custom scorer
When evaluating applications with different data structures than the requirements the predefined scorer, wrap the judge in a custom scorer:
from mlflow.genai.judges import is_correct
from mlflow.genai.scorers import scorer
from typing import Dict, Any
eval_dataset = [
{
"inputs": {"question": "What are the main components of MLflow?"},
"outputs": {
"answer": "MLflow has four main components: Tracking, Projects, Models, and Registry."
},
"expectations": {
"facts": [
"MLflow has four main components",
"Components include Tracking",
"Components include Projects",
"Components include Models",
"Components include Registry"
]
}
},
{
"inputs": {"question": "What is MLflow used for?"},
"outputs": {
"answer": "MLflow is used for building websites."
},
"expectations": {
"facts": [
"MLflow is used for managing ML lifecycle",
"MLflow helps with experiment tracking"
]
}
}
]
@scorer
def correctness_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any], expectations: Dict[Any, Any]):
return is_correct(
request=inputs["question"],
response=outputs["answer"],
expected_facts=expectations["facts"]
)
# Run evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[correctness_scorer]
)
Interpreting Results
The judge returns a Feedback
object with:
value
: "yes" if response is correct, "no" if incorrectrationale
: Detailed explanation of which facts are supported or missing
Next Steps
- Explore other predefined judges - Learn about other built-in quality evaluation judges
- Create custom judges - Build domain-specific evaluation judges
- Run evaluations - Use judges in comprehensive application evaluation