Step 5 (generation). How to debug generation quality

This page describes how to identify the root cause of generation problems. Use this page when root cause analysis indicates a root cause Improve Generation.

Even with optimal retrieval, if the LLM component of a RAG chain cannot effectively utilize the retrieved context to generate accurate, coherent, and relevant responses, the final output quality suffers. Some of the ways that issues with generation quality can appear are hallucinations, inconsistencies, or failure to concisely address the user’s query.


Follow these steps to address generation quality issues:

  1. Open the B_quality_iteration/01_root_cause_quality_issues notebook.

  2. Use the queries to load MLflow traces of the records that had generation quality issues.

  3. For each record, manually examine the generated response and compare it to the retrieved context and the ground-truth response.

  4. Look for patterns or common issues among the queries with low generation quality. For example:

    • Generating information not present in the retrieved context.

    • Generating information that is not consistent with the retrieved context (hallucinating).

    • Failure to directly address the user’s query given the provided retrieved context.

    • Generating responses that are overly verbose, difficult to understand, or lack logical coherence.

  5. Based on the identified issue, hypothesize potential root causes and corresponding fixes. For guidance, see Common reasons for poor generation quality.

  6. Follow the steps in implement and evaluate changes to implement and evaluate a potential fix. This might involve modifying the RAG chain (for example, adjusting the prompt template or trying a different LLM) or the data pipeline (for example, adjusting the chunking strategy to provide more context).

  7. If the generation quality is still not satisfactory, repeat steps 4 and 5 for the next most promising fix until the desired performance is achieved.

  8. Re-run the root cause analysis to determine if the overall chain has any additional root causes that should be addressed.

Common reasons for poor generation quality

The following table lists debugging steps and potential fixes for common generation issues. Fixes are categorized by component:

  • data pipeline tag
  • chain config tag
  • chain code tag

The component defines which steps you should follow in the implement and evaluate changes step.


Databricks recommends that you use prompt engineering to iterate on the quality of your app’s outputs. Most of the following steps use prompt engineering.

Generation issue

Debugging steps

Potential fix

Generated information is not present in the retrieved context (such as hallucinations).

  • Compare generated responses to retrieved context to identify hallucinated information.

  • Assess if certain types of queries or retrieved context are more prone to hallucinations.

  • chain config tag Update prompt template to emphasize reliance on retrieved context.
  • chain config tag Use a more capable LLM.
  • chain code tag Implement a fact-checking or verification step post-generation.

Failure to directly address the user’s query or providing overly generic responses

  • Compare generated responses to user queries to assess relevance and specificity.

  • Check if certain types of queries result in the correct context being retrieved, but the LLM producing low quality output.

  • chain config tag Improve prompt template to encourage direct, specific responses.
  • chain config tag Retrieve more targeted context by improving the retrieval process.
  • chain code tag Re-rank retrieval results to put most relevant chunks first, only provide these to the LLM.
  • chain config tag Use a more capable LLM.

Generated responses are difficult to understand or lack logical flow

  • Assess output for logical flow, grammatical correctness, and understandability.

  • Analyze if incoherence occurs more often with certain types of queries or when certain types of context are retrieved.

  • chain config tag Change prompt template to encourage coherent, well-structured response.
  • chain config tag Provide more context to the LLM by retrieving additional relevant chunks.
  • chain config tag Use a more capable LLM.

Generated responses are not in the desired format or style

  • Compare output to expected format and style guidelines.

  • Assess if certain types of queries or retrieved context are more likely to result in format or style deviations.

  • chain config tag Update prompt template to specify the desired output format and style.
  • chain code tag Implement a post-processing step to convert the generated response into the desired format.
  • chain code tag Add a step to validate output structure and style, and output a fallback answer if needed.
  • chain config tag Use an LLM that is fine-tuned to provide outputs in a specific format or style.

Next step

If you also identified issues with retrieval quality, continue with Step 5 (retrieval). How to debug retrieval quality.

If you think that you have resolved all of the identified issues, continue with Step 6. Iteratively implement & evaluate quality fixes.