Introduction to evaluation & monitoring RAG applications

Evaluation and monitoring are critical components to understand if your RAG application is performing to the *quality, cost, and latency requirements dictated by your use case. Technically, evaluation happens during development and monitoring happens once the application is deployed to production, but the fundamental components are similar.

RAG over unstructured data is a complex system with many components that impact the application’s quality. Adjusting any single element can have cascading effects on the others. For instance, data formatting changes can influence the retrieved chunks and the LLM’s ability to generate relevant responses. Therefore, it’s crucial to evaluate each of the application’s components in addition to the application as a whole in order to iteratively refine it based on those assessments.

Evaluation & monitoring: Classical ML vs. generative AI

Evaluation and monitoring of Generative AI applications, including RAG, differs from classical machine learning in several ways:

Classical ML

Generative AI

Metrics

Metrics evaluate the inputs and outputs of the component, for example, feature drift, precision,recall, latency, and so on. Since there is only one component, overall metrics == component metrics.

Component metrics evaluate the inputs and outputs of each component, for example precision @ K, nDCG, latency, toxicity, and so on. Compound metrics evaluate how multiple components interact: Faithfulness measures the generator’s adherence to the knowledge from a retriever that requires the chain input, chain output, and output of the internal retriever. Overall metrics evaluate the overall input and output of the system, for example, answer correctness and latency.

Evaluation

Answer is deterministically “right” or “wrong.” Deterministic metrics work.

Answer is “right” or “wrong” but: • There are many right answers (non-deterministic). • Some right answers are more right. You need: • Human feedback to be confident. • LLM-judged metrics to scale evaluation.

Components of evaluation and monitoring

Effectively evaluating and monitoring RAG application quality, cost, and latency requires several components:

  • Evaluation set: To rigorously evaluate your RAG application, you need a curated set of evaluation queries (and ideally outputs) that are representative of the application’s intended use. These evaluation examples should be challenging, diverse, and updated to reflect changing usage and requirements.

  • Metric definitions: You can’t manage what you don’t measure. To improve RAG quality, it is essential to define what quality means for your use case. Depending on the application, important metrics might include response accuracy, latency, cost, or ratings from key stakeholders. You’ll need metrics that measure each component, how the components interact with each other, and the overall system.

  • LLM judges: Given the open-ended nature of LLM responses, it is not feasible to read every single response each time you evaluate to determine if the output is correct. Using an additional, different LLM to review outputs can help scale your evaluation and compute additional metrics such as the groundedness of a response to thousands of tokens of context, that would be infeasible for human raters to effectively asses at scale.

  • Evaluation harness: During development, an evaluation harness helps you quickly execute your application for every record in your evaluation set and then run each output through your LLM judges and metric computations. This is particularly challenging since this step “blocks” your inner dev loop, so speed is of the utmost importance. A good evaluation harness parallelizes this work as much as possible, often spinning up additional infrastructure such as more LLM capacity to do so.

  • Stakeholder-facing UI: As a developer, you may not be a domain expert in the content of the application you are developing. To collect feedback from human experts who can assess your application quality, you need an interface that allows them to interact with the application and provide detailed feedback.

  • Production trace logging: Once in production, you need to evaluate a significantly higher quantity of requests/responses and how each response was generated. For example, you need to know if the root cause of a low-quality answer is due to the retrieval step or a hallucination. Your production logging must track the inputs, outputs, and intermediate steps such as document retrieval to enable ongoing monitoring and early detection and diagnosis of issues that arise in production.

These docs cover evaluation in much more detail in Evaluate RAG quality.