Backfill historical traces with scorers

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

You can retroactively apply new or updated scorers to historical traces. This is useful when you add a new scorer and want to evaluate past production data, or when you update an existing scorer and want to re-evaluate previous traces with the new configuration.

Prerequisites

Scorers must be registered and started before they can be used for backfill.
You need the scorer names or BackfillScorerConfig objects to specify which scorers to apply.

Backfill recent data

To backfill only recent traces, specify a start_time relative to the current date:

Python
from datetime import datetime, timedelta

# Backfill last week's data with higher sample rates
one_week_ago = datetime.now() - timedelta(days=7)

job_id = backfill_scorers(
    scorers=[
        BackfillScorerConfig(scorer=safety_judge, sample_rate=0.8),
        BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
    ],
    start_time=one_week_ago
)

Backfill with custom sample rates and time range

To apply scorers with different sample rates than their current configuration, or to limit the backfill to a specific time range, use BackfillScorerConfig:

Python
from databricks.agents.scorers import backfill_scorers, BackfillScorerConfig
from datetime import datetime
from mlflow.genai.scorers import Safety, scorer, ScorerSamplingConfig

safety_judge = Safety()
safety_judge = safety_judge.register(name="safety_check")
safety_judge = safety_judge.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
    """Measure response length in characters"""
    return len(outputs)

response_length = response_length.register(name="response_length")
response_length = response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

# Define custom sample rates for backfill
custom_scorers = [
    BackfillScorerConfig(scorer=safety_judge, sample_rate=0.8),
    BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
]

job_id = backfill_scorers(
    experiment_id=YOUR_EXPERIMENT_ID,
    scorers=custom_scorers,
    start_time=datetime(2024, 6, 1),
    end_time=datetime(2024, 6, 30)
)

Backfill using current sample rates

To apply registered scorers to historical traces using their current sample rate configuration:

Python
from databricks.agents.scorers import backfill_scorers
from mlflow.genai.scorers import Safety, scorer, ScorerSamplingConfig

safety_judge = Safety()
safety_judge = safety_judge.register(name="safety_check")
safety_judge = safety_judge.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
    """Measure response length in characters"""
    return len(outputs)

response_length = response_length.register(name="response_length")
response_length = response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

# Use existing sample rates for specified scorers
job_id = backfill_scorers(
    scorers=["safety_check", "response_length"]
)

Best practices

Start small. Begin with smaller time ranges to estimate job duration and resource usage.
Use appropriate sample rates. Consider the cost and time implications of using high sample rates on large historical datasets.

Troubleshooting

"Scheduled scorer 'X' not found in experiment"

Ensure the scorer name matches a registered scorer in your experiment.
Check available scorers using the list_scorers() method.

Next steps

Monitor GenAI apps in production - Set up production monitoring.
Manage production scorers - Manage the lifecycle of your production scorers.
Scorer lifecycle management API reference - Full API reference including backfill_scorers() parameters.

Prerequisites​

Backfill recent data​

Backfill with custom sample rates and time range​

Backfill using current sample rates​

Best practices​

Troubleshooting​

Next steps​