Predefined judges & scorers

Overview

MLflow provides research-backed judges, wrapped as predefined scorers, for common quality checks available as SDKs.

important

While the judges can be used as standalone APIs, they must be wrapped in Scorers for use by the Evaluation Harness and production monitoring service. MLflow provides predefined implementations of scorers but you can also create custom scorers that use the judge's APIs for more advanced use cases.

Judge	Key Inputs	Requires ground truth	What it evaluates?	Available in predefined scorers
`is_context_relevant`	`request`, `context`	No	Is the `context` directly relevant to the user's `request` without deviating into unrelated topics?	`RelevanceToQuery` `RetrievalRelevance`
`is_safe`	`content`	No	Does the `content` (not) contain harmful, offensive, or toxic material?	`Safety`
`is_grounded`	`request`, `response`, `context`	No	Is the `response` to the `request` grounded in the information provided in the `context` (e.g., the app is not hallucinating a response)?	`RetrievalGroundedness`
`is_correct`	`request`, `response`, `expected_facts`	Yes	Is the `response` to the `request` correct as compared to the provided ground truth `expected_facts`?	`Correctness`
`is_context_sufficient`	`request`, `context`, `expected_facts`	Yes	Does the `context` provide all necessary information to generate a response that includes the ground truth `expected_facts` for the given `request`?	`RetrievalSufficiency`

Prerequisites for running the examples

Install MLflow and required packages

Bash
pip install --upgrade "mlflow[databricks]>=3.1.0"

Create an MLflow experiment by following the setup your environment quickstart.

3 ways to use prebuilt judges

There are 3 ways to use the prebuilt judges.

1. Directly via the SDK

Calling the judges directly via the SDK allows you to intergate the judges directly into your application. For example, you might want to check the groundedness of a response before returning the response back to your user.

Below is an example of using the is_grounded judge SDK. Refer to each judge's page for additional examples.

Python
from mlflow.genai.judges import is_grounded

result = is_grounded(
    request="What is the capital of France?",
    response="Paris",
    context="Paris is the capital of France.",
)
# result is...
# mlflow.entities.Assessment.Feedback(
#     rationale="The response asks 'What is the capital of France?' and answers 'Paris'. The retrieved context states 'Paris is the capital of France.' This directly supports the answer given in the response.",
#     feedback=FeedbackValue(value=<CategoricalRating.YES: 'yes'>)
# )

result = is_grounded(
    request="What is the capital of France?",
    response="Paris",
    context="Paris is known for its Eiffel Tower.",
)

# result is...
# mlflow.entities.Assessment.Feedback(
#     rationale="The retrieved context states that 'Paris is known for its Eiffel Tower,' but it does not mention that Paris is the capital of France. Therefore, the response is not fully supported by the retrieved context.",
#     feedback=FeedbackValue(value=<CategoricalRating.NO: 'no'>)
# )

2. Using via the prebuilt scorers

For simpler applications, you can get started with evaluation using MLflow's predefined scorers.

Below is an example of using the Correctness predefined scorer. Refer to each judge's page for additional examples and the required Trace data schema to use its predefined scorer.

Python
eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "Paris is the magnificent capital city of France, a stunning metropolis known worldwide for its iconic Eiffel Tower, rich cultural heritage, beautiful architecture, world-class museums like the Louvre, and its status as one of Europe's most important political and economic centers. As the capital city, Paris serves as the seat of France's government and is home to numerous important national institutions."
        },
        "expectations": {
            "expected_facts": ["Paris is the capital of France."],
        },
    },
]


from mlflow.genai.scorers import Correctness


eval_results = mlflow.genai.evaluate(data=eval_dataset, scorers=[Correctness])

3. Using in custom Scorers

As your application logic and evaluation criteria gets more complex, you need more control over the data passed to the judge, or your application's trace does not meet the predefined scorer's requirements, you can wrap the judge's SDK in a custom scorer

Below is an example for wrapping the is_grounded judge SDK in a custom scorer.

Python
from mlflow.genai.judges import is_grounded
from mlflow.genai.scorers import scorer

eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "Paris",
            "retrieved_context": [
                {
                    "content": "Paris is the capital of France.",
                    "source": "wikipedia",
                }
            ],
        },
    },
]

@scorer
def is_grounded_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    return is_grounded(
        request=inputs["query"],
        response=outputs["response"],
        context=outputs["retrieved_context"],
    )

eval_results = mlflow.genai.evaluate(data=eval_dataset, scorers=[is_grounded_scorer])

Evaluation results

Next Steps

Use predefined scorers in evaluation - Get started with built-in quality metrics
Create custom judges - Build judges tailored to your specific needs
Run evaluations - Apply judges to systematically assess your application

Overview​

Prerequisites for running the examples​

3 ways to use prebuilt judges​

1. Directly via the SDK​

2. Using via the prebuilt scorers​

3. Using in custom Scorers​

Next Steps​