Backfill historical traces with scorers
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
You can retroactively apply new or updated scorers to historical traces. This is useful when you add a new scorer and want to evaluate past production data, or when you update an existing scorer and want to re-evaluate previous traces with the new configuration.
Prerequisites
- Scorers must be registered and started before they can be used for backfill.
- You need the scorer names or
BackfillScorerConfigobjects to specify which scorers to apply.
Backfill recent data
To backfill only recent traces, specify a start_time relative to the current date:
from datetime import datetime, timedelta
# Backfill last week's data with higher sample rates
one_week_ago = datetime.now() - timedelta(days=7)
job_id = backfill_scorers(
scorers=[
BackfillScorerConfig(scorer=safety_judge, sample_rate=0.8),
BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
],
start_time=one_week_ago
)
Backfill with custom sample rates and time range
To apply scorers with different sample rates than their current configuration, or to limit the backfill to a specific time range, use BackfillScorerConfig:
from databricks.agents.scorers import backfill_scorers, BackfillScorerConfig
from datetime import datetime
from mlflow.genai.scorers import Safety, scorer, ScorerSamplingConfig
safety_judge = Safety()
safety_judge = safety_judge.register(name="safety_check")
safety_judge = safety_judge.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
"""Measure response length in characters"""
return len(outputs)
response_length = response_length.register(name="response_length")
response_length = response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
# Define custom sample rates for backfill
custom_scorers = [
BackfillScorerConfig(scorer=safety_judge, sample_rate=0.8),
BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
]
job_id = backfill_scorers(
experiment_id=YOUR_EXPERIMENT_ID,
scorers=custom_scorers,
start_time=datetime(2024, 6, 1),
end_time=datetime(2024, 6, 30)
)
Backfill using current sample rates
To apply registered scorers to historical traces using their current sample rate configuration:
from databricks.agents.scorers import backfill_scorers
from mlflow.genai.scorers import Safety, scorer, ScorerSamplingConfig
safety_judge = Safety()
safety_judge = safety_judge.register(name="safety_check")
safety_judge = safety_judge.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
"""Measure response length in characters"""
return len(outputs)
response_length = response_length.register(name="response_length")
response_length = response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
# Use existing sample rates for specified scorers
job_id = backfill_scorers(
scorers=["safety_check", "response_length"]
)
Best practices
- Start small. Begin with smaller time ranges to estimate job duration and resource usage.
- Use appropriate sample rates. Consider the cost and time implications of using high sample rates on large historical datasets.
Troubleshooting
"Scheduled scorer 'X' not found in experiment"
- Ensure the scorer name matches a registered scorer in your experiment.
- Check available scorers using the
list_scorers()method.
Next steps
- Monitor GenAI apps in production - Set up production monitoring.
- Manage production scorers - Manage the lifecycle of your production scorers.
- Scorer lifecycle management API reference - Full API reference including
backfill_scorers()parameters.