Step 5 (generation). How to debug generation quality
This page describes how to identify the root cause of generation problems. Use this page when root cause analysis indicates a root cause Improve Generation
.
Even with optimal retrieval, if the LLM component of a RAG chain cannot effectively utilize the retrieved context to generate accurate, coherent, and relevant responses, the final output quality suffers. Some of the ways that issues with generation quality can appear are hallucinations, inconsistencies, or failure to concisely address the user’s query.
Instructions
Follow these steps to address generation quality issues:
Open the B_quality_iteration/01_root_cause_quality_issues notebook.
Use the queries to load MLflow traces of the records that had generation quality issues.
For each record, manually examine the generated response and compare it to the retrieved context and the ground-truth response.
Look for patterns or common issues among the queries with low generation quality. For example:
Generating information not present in the retrieved context.
Generating information that is not consistent with the retrieved context (hallucinating).
Failure to directly address the user’s query given the provided retrieved context.
Generating responses that are overly verbose, difficult to understand, or lack logical coherence.
Based on the identified issue, hypothesize potential root causes and corresponding fixes. For guidance, see Common reasons for poor generation quality.
Follow the steps in implement and evaluate changes to implement and evaluate a potential fix. This might involve modifying the RAG chain (for example, adjusting the prompt template or trying a different LLM) or the data pipeline (for example, adjusting the chunking strategy to provide more context).
If the generation quality is still not satisfactory, repeat steps 4 and 5 for the next most promising fix until the desired performance is achieved.
Re-run the root cause analysis to determine if the overall chain has any additional root causes that should be addressed.
Common reasons for poor generation quality
The following table lists debugging steps and potential fixes for common generation issues. Fixes are categorized by component:
The component defines which steps you should follow in the implement and evaluate changes step.
Important
Databricks recommends that you use prompt engineering to iterate on the quality of your app’s outputs. Most of the following steps use prompt engineering.
Generation issue |
Debugging steps |
Potential fix |
---|---|---|
Generated information is not present in the retrieved context (such as hallucinations). |
|
|
Failure to directly address the user’s query or providing overly generic responses |
|
|
Generated responses are difficult to understand or lack logical flow |
|
|
Generated responses are not in the desired format or style |
|
|
Next step
If you also identified issues with retrieval quality, continue with Step 5 (retrieval). How to debug retrieval quality.
If you think that you have resolved all of the identified issues, continue with Step 6. Iteratively implement & evaluate quality fixes.