Step 5. Identify the root cause of quality issues
See the GitHub repository for the sample code in this section.
Expected time: 60 minutes.
Requirements
Evaluation results for the POC are available in MLflow. If you followed Step 4. Evaluate the POC’s quality, the results are available in MLflow.
All requirements from previous steps.
Overview
The most likely root causes of quality issues are the retrieval and generation steps. To determine where to focus first, use the output of the Mosaic AI Agent Evaluation LLM judges that you ran in the previous step to identify the most frequent root cause that impacts your app’s quality.
Each row your evaluation set is tagged as follows:
Overall assessment: or .
Root cause:
Improve Retrieval
orImprove Generation
.Root cause rationale: A brief description of why the root cause was selected.
Instructions
The approach depends on if your evaluation set contains the ground-truth responses to your questions. These responses are stored in expected_response
. If you have expected_response
available, use the table Root cause analysis if ground truth is available. Otherwise, use the table Root cause analysis if ground truth is not available.
Open the B_quality_iteration/01_root_cause_quality_issues notebook.
Run the cells that are relevant to your use case e.g., if you do or don’t have expected_response
Review the output tables to determine the most frequent root cause in your application
For each root cause, follow the steps below to further debug and identify potential fixes:
Root cause analysis if ground truth is available
Note
If you have human labeled ground-truth for which document should be retrieved for each question, you can optionally substitute retrieval/llm_judged/chunk_relevance/precision/average
with the score for retrieval/ground_truth/document_recall/average
.
Chunk relevance precision |
Groundedness |
Correctness |
Relevance to query |
Issue summary |
Root cause |
Overall rating |
---|---|---|---|---|---|---|
<50% |
❌ |
❌ |
❌ |
Retrieval is poor. |
|
|
<50% |
❌ |
❌ |
✅ |
LLM generates relevant response, but retrieval is poor. For example, the LLM ignores retrieval and uses its training knowledge to answer. |
|
|
<50% |
❌ |
✅ |
✅ or ❌ |
Retrieval quality is poor, but LLM gets the answer correct regardless. |
|
|
<50% |
✅ |
❌ |
❌ |
Response is grounded in retrieval, but retrieval is poor. |
|
|
<50% |
✅ |
❌ |
✅ |
Relevant response grounded in the retrieved context, but retrieval may not be related to the expected answer. |
|
|
<50% |
✅ |
✅ |
✅ or ❌ |
Retrieval finds enough information for the LLM to correctly answer. |
None |
|
>50% |
❌ |
❌ |
✅ or ❌ |
Hallucination. |
|
|
>50% |
❌ |
✅ |
✅ or ❌ |
Hallucination, correct but generates details not in context. |
|
|
>50% |
✅ |
❌ |
❌ |
Good retrieval, but the LLM does not provide a relevant response. |
|
|
>50% |
✅ |
❌ |
✅ |
Good retrieval and relevant response, but not correct. |
|
|
>50% |
✅ |
✅ |
✅ |
No issues. |
None |
|
Root cause analysis if ground truth is not available
Chunk relevance precision |
Groundedness |
Relevance to query |
Issue summary |
Root cause |
Overall rating |
---|---|---|---|---|---|
<50% |
❌ |
❌ |
Retrieval quality is poor. |
|
|
<50% |
❌ |
✅ |
Retrieval quality is poor. |
|
|
<50% |
✅ |
❌ |
Response is grounded in retrieval, but retrieval is poor. |
|
|
<50% |
✅ |
✅ |
Relevant response grounded in the retrieved context and relevant, but retrieval is poor. |
|
|
>50% |
❌ |
❌ |
Hallucination. |
|
|
>50% |
❌ |
✅ |
Hallucination. |
|
|
>50% |
✅ |
❌ |
Good retrieval and grounded, but LLM does not provide a relevant response. |
|
|
>50% |
✅ |
✅ |
Good retrieval and relevant response. Collect ground-truth to know if the answer is correct. |
None |
|