Scorers
Scorers evaluate GenAI app quality by analyzing outputs and producing structured feedback. Write once, use everywhere - in development and production.
Quick reference
Return Type | UI Display | Use Case |
---|---|---|
| Pass/Fail | Binary evaluation |
| True/False | Boolean checks |
| Numeric value | Scores, counts |
| Value + rationale | Detailed assessment |
| Multiple metrics | Multi-aspect evaluation |
Write once, use everywhere
A key design principle of MLflow scorers is write once, use everywhere. The same scorer function works seamlessly in:
- Development: Evaluate different versions of your app using
mlflow.genai.evaluate()
- Production: Monitor live traffic quality with MLflow's production monitoring service
This unified approach means you can develop and test your quality metrics locally, then deploy the exact same logic to production without modification.
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
# Define your scorer once
@scorer
def response_completeness(outputs: str) -> Feedback:
# Outputs is return value of your app. Here we assume it's a string.
if len(outputs.strip()) < 10:
return Feedback(
value=False,
rationale="Response too short to be meaningful"
)
if outputs.lower().endswith(("...", "etc", "and so on")):
return Feedback(
value=False,
rationale="Response appears incomplete"
)
return Feedback(
value=True,
rationale="Response appears complete"
)
# Directly call the scorer function for spot testing
response_completeness(outputs="This is a test response...")
# Use in development evaluation
mlflow.genai.evaluate(
data=test_dataset,
predict_fn=my_app,
scorers=[response_completeness]
)
How scorers work
Scorers analyze traces from your GenAI application and produce quality assessments. Here's the flow:
- Your app runs and produces a trace capturing its execution
- MLflow passes the trace to your scorer function
- Scorers analyze the trace's inputs, outputs, and intermediate execution steps using custom logic
- Feedback is produced with scores and explanations
- Feedbacks are attached to the trace for analysis
Inputs
Scorers receive the complete MLflow trace containing all spans, attributes, and outputs. As a convenience, MLflow also extracts commonly needed data and passes it as named arguments:
@scorer
def my_custom_scorer(
*, # All arguments are keyword-only
inputs: Optional[dict[str, Any]], # App's raw input, a dictionary of input argument names and values
outputs: Optional[Any], # App's raw output
expectations: Optional[dict[str, Any]], # Ground truth, a dictionary of label names and values
trace: Optional[mlflow.entities.Trace] # Complete trace with all metadata
) -> Union[int, float, bool, str, Feedback, List[Feedback]]:
# Your evaluation logic here
All parameters are optional—declare only what your scorer needs:
- inputs: The request sent to your app (e.g., user query, context).
- outputs: The response from your app (e.g., generated text, tool calls)
- expectations: Ground truth or labels (e.g., expected response, guidelines, etc.)
- trace: The complete execution trace with all spans, allowing analysis of intermediate steps, latency, tool usage, etc.
When running mlflow.genai.evaluate()
, the inputs
, outputs
, and expectations
parameters can be specified in the data
argument, or parsed from the trace.
When running mlflow.genai.create_monitor()
, the inputs
and outputs
parameters are always parsed from the trace. expectations
is not available.
Outputs
Scorers can return different types depending on your evaluation needs:
Simple values
Return primitive values for straightforward pass/fail or numeric assessments.
- Pass/fail strings:
"yes"
or"no"
render as "Pass" or "Fail" in the UI - Boolean values:
True
orFalse
for binary evaluations - Numeric values: Integers or floats for scores, counts, or measurements
# These example assumes your app returns a string as a response.
@scorer
def response_length(outputs: str) -> int:
# Return a numeric metric
return len(outputs.split())
@scorer
def contains_citation(outputs: str) -> str:
# Return pass/fail string
return "yes" if "[source]" in outputs else "no"
Rich feedback
Return Feedback
objects for detailed assessments with explanations:
from mlflow.entities import Feedback, AssessmentSource
@scorer
def content_quality(outputs):
return Feedback(
value=0.85, # Can be numeric, boolean, or string
rationale="Clear and accurate, minor grammar issues",
# Optional: source of the assessment. Several source types are supported,
# such as "HUMAN", "CODE", "LLM_JUDGE".
source=AssessmentSource(
source_type="HUMAN",
source_id="grammar_checker_v1"
),
# Optional: additional metadata about the assessment.
metadata={
"annotator": "me@example.com",
}
)
Multiple feedback objects can be returned as a list. Each feedback will be displayed as a separate metric in the evaluation results.
@scorer
def comprehensive_check(inputs, outputs):
return [
Feedback(name="relevance", value=True, rationale="Directly addresses query"),
Feedback(name="tone", value="professional", rationale="Appropriate for audience"),
Feedback(name="length", value=150, rationale="Word count within limits")
]
Metric naming behavior
When using the @scorer
decorator, the metric names in the evaluation results follow these rules:
-
Primitive value or single feedback without a name: The scorer function name becomes the feedback name
Python@scorer
def word_count(outputs: str) -> int:
# "word_count" will be used as a metric name
return len(outputs).split()
@scorer
def response_quality(outputs: Any) -> Feedback:
# "response_quality" will be used as a metric name
return Feedback(value=True, rationale="Good quality") -
Single feedback with an explcit name: The name specified in the Feedback object is used as the metric name
Python@scorer
def assess_factualness(outputs: Any) -> Feedback:
# Name "factual_accuracy" is explicitly specfied, it will be used as a metric name
return Feedback(name="factual_accuracy", value=True, rationale="Factual accuracy is high") -
Multiple feedbacks: Names specified in each Feedback objects are preserved. You must specify a unique name for each feedback.
Python@scorer
def multi_aspect_check(outputs) -> list[Feedback]:
# These names ARE used since multiple feedbacks are returned
return [
Feedback(name="grammar", value=True, rationale="No errors"),
Feedback(name="clarity", value=0.9, rationale="Very clear"),
Feedback(name="completeness", value="yes", rationale="All points addressed")
]
This naming behavior ensures consistent metric names in your evaluation results and dashboards.
Error handling
When a scorer encounters an error, MLflow provides two approaches:
Let exceptions propagate (recommended)
The simplest approach is to let exceptions throw naturally. MLflow automatically captures the exception and creates a Feedback object with the error details:
import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer
@scorer
def is_valid_response(outputs: str) -> Feedback:
import json
# Let json.JSONDecodeError propagate if response isn't valid JSON
data = json.loads(outputs)
# Let KeyError propagate if required fields are missing
summary = data["summary"]
confidence = data["confidence"]
return Feedback(
value=True,
rationale=f"Valid JSON with confidence: {confidence}"
)
# Run the scorer on invalid data that triggers exceptions
invalid_data = [
{
# Valid JSON
"outputs": '{"summary": "this is a summary", "confidence": 0.95}'
},
{
# Invalid JSON
"outputs": "invalid json",
},
{
# Missing required fields
"outputs": '{"summary": "this is a summary"}'
},
]
mlflow.genai.evaluate(
data=invalid_data,
scorers=[is_valid_response],
)
When an exception occurs, MLflow creates a Feedback with:
value
:None
error
: The exception details, such as exception object, error message, and stack trace
The error information will be displayed in the evaluation results. Open the corresponding row to see the error details.
Handle exceptions explicitly
For custom error handling or to provide specific error messages, catch exceptions and return a Feedback with None
value and error details:
from mlflow.entities import AssessmentError, Feedback
@scorer
def is_valid_response(outputs):
import json
try:
data = json.loads(outputs)
required_fields = ["summary", "confidence", "sources"]
missing = [f for f in required_fields if f not in data]
if missing:
return Feedback(
error=AssessmentError(
error_code="MISSING_REQUIRED_FIELDS",
error_message=f"Missing required fields: {missing}",
),
)
return Feedback(
value=True,
rationale="Valid JSON with all required fields"
)
except json.JSONDecodeError as e:
return Feedback(error=e) # Can pass exception object directly to the error parameter
The error
parameter accepts:
- Python Exception: Pass the exception object directly
- AssessmentError: For structured error reporting with error codes
When expectations are available
Expectations (ground truth or labels) are typically important for offline evaluation. You can specify them in two ways when running mlflow.genai.evaluate()
:
- Include
expectations
column (or field) in the inputdata
argument. - Associate
Expectation
to Traces and pass them to thedata
argument.
@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> Feedback:
expected = expectations.get("expected_response")
is_correct = outputs == expected
return Feedback(
value=is_correct,
rationale=f"Response {'matches' if is_correct else 'differs from'} expected"
)
data = [
{
"inputs": {"question": "What is the capital of France?"},
"outputs": "Paris",
# Specify expected response in the expectations field
"expectations": {
"expected_response": "Paris"
}
},
]
mlflow.genai.evaluate(
data=data,
scorers=[exact_match],
)
Production monitoring typically doesn't have expectations since you're evaluating live traffic without ground truth. If you intend to use the same scorer for both offline and online evaluation, design it to handle expectations gracefully
Using trace data
Scorers can access the full trace to evaluate complex application behavior:
from mlflow.entities import Feedback, Trace
from mlflow.genai.scorers import scorer
@scorer
def tool_call_efficiency(trace: Trace) -> Feedback:
"""Evaluate how effectively the app uses tools"""
# Retrieve all tool call spans from the trace
tool_calls = trace.search_spans(span_type="TOOL")
if not tool_calls:
return Feedback(
value=None,
rationale="No tool usage to evaluate"
)
# Check for redundant calls
tool_names = [span.name for span in tool_calls]
if len(tool_names) != len(set(tool_names)):
return Feedback(
value=False,
rationale=f"Redundant tool calls detected: {tool_names}"
)
# Check for errors
failed_calls = [s for s in tool_calls if s.status.status_code != "OK"]
if failed_calls:
return Feedback(
value=False,
rationale=f"{len(failed_calls)} tool calls failed"
)
return Feedback(
value=True,
rationale=f"Efficient tool usage: {len(tool_calls)} successful calls"
)
When running offline evaluation with mlflow.genai.evaluate()
, the traces are:
- specified in the
data
argument if they are already available. - generated by runing
predict_fn
against theinputs
in thedata
argument.
When running production monitoring with mlflow.genai.create_monitor()
, traces collected by the monitor are passed directly to the scorer function, with the specified sampling and filtering criteria.
Scorer implementation approaches
MLflow provides two ways to implement scorers:
Decorator approach (recommended)
Use the @scorer
decorator for simple, function-based scorers:
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
@scorer
def response_tone(outputs: str) -> Feedback:
"""Check if response maintains professional tone"""
informal_phrases = ["hey", "gonna", "wanna", "lol", "btw"]
found = [p for p in informal_phrases if p in outputs.lower()]
if found:
return Feedback(
value=False,
rationale=f"Informal language detected: {', '.join(found)}"
)
return Feedback(
value=True,
rationale="Professional tone maintained"
)
Class-based approach
Use the Scorer
base class for more complex scorers that require state. The Scorer
class is a Pydantic object, so you can define additional fields and use them in the __call__
method.
from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback
from typing import Optional
# Scorer class is a Pydantic object
class ResponseQualityScorer(Scorer):
# The `name` field is mandatory
name: str = "response_quality"
# Define additiona lfields
min_length: int = 50
required_sections: Optional[list[str]] = None
# Override the __call__ method to implement the scorer logic
def __call__(self, outputs: str) -> Feedback:
issues = []
# Check length
if len(outputs.split()) < self.min_length:
issues.append(f"Too short (minimum {self.min_length} words)")
# Check required sections
missing = [s for s in self.required_sections if s not in outputs]
if missing:
issues.append(f"Missing sections: {', '.join(missing)}")
if issues:
return Feedback(
value=False,
rationale="; ".join(issues)
)
return Feedback(
value=True,
rationale="Response meets all quality criteria"
)
Custom scorer development workflow
When developing custom scorers, you often need to iterate quickly without re-running your application each time. MLflow supports an efficient workflow:
- Generate traces once by running your app with
mlflow.genai.evaluate()
- Store the traces using
mlflow.search_traces()
- Iterate on scorers by passing stored traces to
evaluate()
without re-running your app
This approach saves time and resources during scorer development:
# Step 1: Generate traces with a placeholder scorer
initial_results = mlflow.genai.evaluate(
data=test_dataset,
predict_fn=my_app,
scorers=[lambda **kwargs: 1] # Placeholder scorer
)
# Step 2: Store traces for reuse
traces = mlflow.search_traces(run_id=initial_results.run_id)
# Step 3: Iterate on your scorer without re-running the app
@scorer
def my_custom_scorer(outputs):
# Your evaluation logic here
pass
# Test scorer on stored traces (no predict_fn needed)
results = mlflow.genai.evaluate(
data=traces,
scorers=[my_custom_scorer]
)
Common gotchas
Scorer naming with decorators
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
# GOTCHA: Function name becomes feedback name for single returns
@scorer
def quality_check(outputs):
# This 'name' parameter is IGNORED
return Feedback(name="ignored", value=True)
# Feedback will be named "quality_check"
# CORRECT: Use function name meaningfully
@scorer
def response_quality(outputs):
return Feedback(value=True, rationale="Good quality")
# Feedback will be named "response_quality"
# EXCEPTION: Multiple feedbacks preserve their names
@scorer
def multi_check(outputs):
return [
Feedback(name="grammar", value=True), # Name preserved
Feedback(name="spelling", value=True), # Name preserved
Feedback(name="clarity", value=0.9) # Name preserved
]
State management in scorers
from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback
# WRONG: Don't use mutable class attributes
class BadScorer(Scorer):
results = [] # Shared across all instances!
def __call__(self, outputs, **kwargs):
self.results.append(outputs) # Causes issues
return Feedback(value=True)
# CORRECT: Use instance attributes
class GoodScorer(Scorer):
def __init__(self):
super().__init__(name="good_scorer")
self.results = [] # Per-instance state
def __call__(self, outputs, **kwargs):
self.results.append(outputs) # Safe
return Feedback(value=True)
Next Steps
- Create custom code-based scorers - Build domain-specific scorers using Python functions
- Evaluate with LLM judges - Use pre-built LLM-based scorers for common quality metrics
- Run scorers in production - Apply the same scorers to monitor production traffic