Use traces to evaluate and improve quality

Traces are not just for debugging—they contain valuable information that can drive systematic quality improvements in your GenAI applications. This guide shows you how to analyze traces to identify quality issues, create evaluation datasets from trace data, implement targeted improvements, and measure the impact of your changes.

Analyzing traces to identify quality issues

Traces provide detailed insights into how your application processes user requests. By analyzing these traces, you can identify patterns of quality issues:

Quantitative analysis

Use the MLflow UI to filter and group traces with similar characteristics:
- Filter by specific tags (e.g., tag.issue_type = "hallucination")
- Search for traces containing specific inputs or outputs
- Sort by metadata like latency or token usage
Query traces programmatically to perform more advanced analysis:

Python
import mlflow
import pandas as pd

# Search for traces with potential quality issues
traces_df = mlflow.search_traces(
    filter_string="tag.quality_score < 0.7",
    max_results=100,
    extract_fields=["span.end_time", "span.inputs.messages", "span.outputs.choices", "span.attributes.usage.total_tokens"]
)

# Analyze patterns
# For example, check if quality issues correlate with token usage
correlation = traces_df["span.attributes.usage.total_tokens"].corr(traces_df["tag.quality_score"])
print(f"Correlation between token usage and quality: {correlation}")

Qualitative analysis

Review individual traces that represent common failure modes:
- Examine the inputs that led to low-quality outputs
- Look for patterns in how the application handled these cases
- Identify missing context or faulty reasoning
Compare high-quality vs. low-quality traces:
- What differs in the way your application processes these different inputs?
- Are there specific types of queries that consistently lead to quality issues?

Creating evaluation datasets from trace data

Once you've identified representative traces, you can curate them into evaluation datasets for systematic testing:

Export traces to a dataset:

Python
import mlflow
import pandas as pd

# Query traces that represent important test cases
traces_df = mlflow.search_traces(
    filter_string="trace.timestamp > '2023-07-01'",
    max_results=500,
    extract_fields=["span.inputs.messages", "span.outputs.choices"]
)

# Prepare dataset format
eval_data = []
for _, row in traces_df.iterrows():
    # Extract user query from messages
    messages = row["span.inputs.messages"]
    user_query = next((msg["content"] for msg in messages if msg["role"] == "user"), None)

    # Extract model response
    response = row["span.outputs.choices"][0]["message"]["content"] if row["span.outputs.choices"] else None

    if user_query and response:
        eval_data.append({"input": user_query, "output": response})

# Create evaluation dataset
eval_df = pd.DataFrame(eval_data)
eval_df.to_csv("evaluation_dataset.csv", index=False)

Add ground truth or expected outputs:
- For each trace, add the correct or expected output
- Include quality indicators or specific aspects to evaluate
- Consider leveraging domain experts to review and annotate the dataset
Register the dataset with MLflow:

Python
import mlflow

# Log the evaluation dataset
with mlflow.start_run() as run:
    mlflow.log_artifact("evaluation_dataset.csv", "evaluation_datasets")

Implementing targeted improvements

With identified issues and evaluation datasets in hand, you can make targeted improvements:

Prompt engineering

Refine system prompts to address specific failure patterns:
- Add more explicit guidelines for handling edge cases
- Include examples that demonstrate how to handle problematic inputs
- Adjust the tone or style to better meet user expectations
Add guardrails to prevent common quality issues:
- Implement validation steps in your application logic
- Add post-processing to check outputs before presenting them to users

Application architecture improvements

Enhance retrieval mechanisms if relevant documents aren't being found:
- Examine retrieval spans in traces to see what's being retrieved
- Improve embedding models or retrieval algorithms
- Consider chunking strategies if document segments are suboptimal
Add reasoning steps to complex decision processes:
- Break down complex tasks into multiple spans
- Implement chain-of-thought or other reasoning techniques
- Add verification steps for critical outputs

Measuring quality improvements

After implementing changes, use MLflow to measure their impact:

Run systematic evaluations using your curated dataset:

Python
import mlflow
from mlflow.evaluators import evaluate

# Compare the original and improved models on your dataset
results = mlflow.evaluate(
    data="evaluation_dataset.csv",
    model=improved_model,  # Your improved model/application
    baseline_model=original_model,  # The original version for comparison
    evaluators=["mlflow_model"],
    evaluator_config={
        "custom_metrics": [
            # Define your quality metrics here
        ]
    }
)

# View the results
print(results.metrics)

Monitor production traces after deploying improvements:
- Set up dashboards to track quality metrics over time
- Monitor for regressions or unexpected behavior
- Continuously collect new traces to identify emerging issues

Next steps

Continue your journey with these recommended actions and tutorials.

Collect user feedback - Add structured quality feedback to traces
Build evaluation datasets - Create comprehensive test sets from production traces
Set up production monitoring - Track quality metrics in real-time

Reference guides

Explore detailed documentation for concepts and features mentioned in this guide.

Query traces via SDK - Learn programmatic trace analysis techniques
Evaluation concepts - Understand scorers, judges, and evaluation methodology
Tracing data model - Explore trace structure and attributes

tip

Quality improvement is an iterative process. Start with your most critical quality issues, implement targeted improvements, measure their impact, and repeat.

Analyzing traces to identify quality issues​

Quantitative analysis​

Qualitative analysis​

Creating evaluation datasets from trace data​

Implementing targeted improvements​

Prompt engineering​

Application architecture improvements​

Measuring quality improvements​

Next steps​

Reference guides​