Scorers
Scorers evaluate GenAI app quality by analyzing outputs and producing structured feedback. The same scorer can be used for evaluation in development and reused for monitoring in production. Scorers include:
The MLflow UI screenshot below illustrates outputs from a built-in scorer Safety
and a custom scorer exact_match
:
The code snippet below computes these metrics using mlflow.genai.evaluate()
and then registers the same scorers for production monitoring:
Python
import mlflow
from mlflow.genai.scorers import Safety, ScorerSamplingConfig, scorer
from typing import Any
@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
# Example of a custom code-based scorer
return outputs == expectations["expected_response"]
# Evaluation during development
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=my_app,
scorers=[Safety(), exact_match]
)
# Production monitoring - same scorers!
registered_scorers = [
Safety().register(),
exact_match.register(),
]
registered_scorers = [
reg_scorer.start(
sampling_config=ScorerSamplingConfig(sample_rate=0.1)
)
for reg_scorer in registered_scorers
]
Next steps
- Use built-in LLM scorers - Start evaluating your app quickly with built-in LLM-as-a-judge scorers
- Creating custom LLM scorers - Customize LLM-as-a-judge scorers for your specific application
- Create custom code-based scorers - Build code-based scorers, including possible inputs, outputs, and error handling
- Evaluation harness - Understand how
mlflow.genai.evaluate()
uses your scorers - Production monitoring for GenAI - Deploy your scorers for continuous monitoring