Use traces to evaluate and improve quality
Traces are not just for debugging—they contain valuable information that can drive systematic quality improvements in your GenAI applications. This guide shows you how to analyze traces to identify quality issues, create evaluation datasets from trace data, implement targeted improvements, and measure the impact of your changes.
Analyzing traces to identify quality issues
Traces provide detailed insights into how your application processes user requests. By analyzing these traces, you can identify patterns of quality issues:
Quantitative analysis
-
Use the MLflow UI to filter and group traces with similar characteristics:
- Filter by specific tags (e.g.,
tag.issue_type = "hallucination"
) - Search for traces containing specific inputs or outputs
- Sort by metadata like latency or token usage
- Filter by specific tags (e.g.,
-
Query traces programmatically to perform more advanced analysis:
import mlflow
import pandas as pd
# Search for traces with potential quality issues
traces_df = mlflow.search_traces(
filter_string="tag.quality_score < 0.7",
max_results=100,
extract_fields=["span.end_time", "span.inputs.messages", "span.outputs.choices", "span.attributes.usage.total_tokens"]
)
# Analyze patterns
# For example, check if quality issues correlate with token usage
correlation = traces_df["span.attributes.usage.total_tokens"].corr(traces_df["tag.quality_score"])
print(f"Correlation between token usage and quality: {correlation}")
Qualitative analysis
-
Review individual traces that represent common failure modes:
- Examine the inputs that led to low-quality outputs
- Look for patterns in how the application handled these cases
- Identify missing context or faulty reasoning
-
Compare high-quality vs. low-quality traces:
- What differs in the way your application processes these different inputs?
- Are there specific types of queries that consistently lead to quality issues?
Creating evaluation datasets from trace data
Once you've identified representative traces, you can curate them into evaluation datasets for systematic testing:
- Export traces to a dataset:
import mlflow
import pandas as pd
# Query traces that represent important test cases
traces_df = mlflow.search_traces(
filter_string="trace.timestamp > '2023-07-01'",
max_results=500,
extract_fields=["span.inputs.messages", "span.outputs.choices"]
)
# Prepare dataset format
eval_data = []
for _, row in traces_df.iterrows():
# Extract user query from messages
messages = row["span.inputs.messages"]
user_query = next((msg["content"] for msg in messages if msg["role"] == "user"), None)
# Extract model response
response = row["span.outputs.choices"][0]["message"]["content"] if row["span.outputs.choices"] else None
if user_query and response:
eval_data.append({"input": user_query, "output": response})
# Create evaluation dataset
eval_df = pd.DataFrame(eval_data)
eval_df.to_csv("evaluation_dataset.csv", index=False)
-
Add ground truth or expected outputs:
- For each trace, add the correct or expected output
- Include quality indicators or specific aspects to evaluate
- Consider leveraging domain experts to review and annotate the dataset
-
Register the dataset with MLflow:
import mlflow
# Log the evaluation dataset
with mlflow.start_run() as run:
mlflow.log_artifact("evaluation_dataset.csv", "evaluation_datasets")
Implementing targeted improvements
With identified issues and evaluation datasets in hand, you can make targeted improvements:
Prompt engineering
-
Refine system prompts to address specific failure patterns:
- Add more explicit guidelines for handling edge cases
- Include examples that demonstrate how to handle problematic inputs
- Adjust the tone or style to better meet user expectations
-
Add guardrails to prevent common quality issues:
- Implement validation steps in your application logic
- Add post-processing to check outputs before presenting them to users
Application architecture improvements
-
Enhance retrieval mechanisms if relevant documents aren't being found:
- Examine retrieval spans in traces to see what's being retrieved
- Improve embedding models or retrieval algorithms
- Consider chunking strategies if document segments are suboptimal
-
Add reasoning steps to complex decision processes:
- Break down complex tasks into multiple spans
- Implement chain-of-thought or other reasoning techniques
- Add verification steps for critical outputs
Measuring quality improvements
After implementing changes, use MLflow to measure their impact:
- Run systematic evaluations using your curated dataset:
import mlflow
from mlflow.evaluators import evaluate
# Compare the original and improved models on your dataset
results = mlflow.evaluate(
data="evaluation_dataset.csv",
model=improved_model, # Your improved model/application
baseline_model=original_model, # The original version for comparison
evaluators=["mlflow_model"],
evaluator_config={
"custom_metrics": [
# Define your quality metrics here
]
}
)
# View the results
print(results.metrics)
- Monitor production traces after deploying improvements:
- Set up dashboards to track quality metrics over time
- Monitor for regressions or unexpected behavior
- Continuously collect new traces to identify emerging issues
Next steps
Continue your journey with these recommended actions and tutorials.
- Collect user feedback - Add structured quality feedback to traces
- Build evaluation datasets - Create comprehensive test sets from production traces
- Set up production monitoring - Track quality metrics in real-time
Reference guides
Explore detailed documentation for concepts and features mentioned in this guide.
- Query traces via SDK - Learn programmatic trace analysis techniques
- Evaluation concepts - Understand scorers, judges, and evaluation methodology
- Tracing data model - Explore trace structure and attributes
Quality improvement is an iterative process. Start with your most critical quality issues, implement targeted improvements, measure their impact, and repeat.