Metrics

Preview

This feature is in Private Preview. To try it, reach out to your Databricks contact.

Looking for a different RAG Studio doc? Go to the RAG documentation index

To evaluate your RAG Application, use 📈 Metrics. Databricks provides a set of metrics that enable you to measure the quality, cost and latency of your RAG Application. These metrics are curated by Databricks’ Research team as the most relevant (no pun intended) metrics for evaluating RAG applications.

📈 Metrics are computed using either:

  1. User traffic: 👍 Assessments and 🗂️ Request Log

  2. 📖 Evaluation Set: developer curated 👍 Assessments and 🗂️ Request Log that represent common requests

For most metrics, 👍 Assessments comes from either 🤖 LLM Judge, 🧠 Expert Users, or 👤 End Users. A small subset of the metrics, such as answer correctness, require 🧠 Expert Users or 👤 End Users annotated asessments.

Collecting 👍 Assessments

From a 🤖 LLM Judge

From 👤 End Users & 🧠 Expert Users

Compute metrics

Metrics are computed as 📈 Evaluation Results by RAG Studio and stored in the 👍 Assessment & Evaluation Results Log.

There are 2 ways to compute metrics:

  1. Automatic Metrics are automatically computed for all traffic that calls the 🔗 Chain’s REST API (hosted on Mosaic AI Model Serving). .. note:: 🔗 Chain’s REST API (hosted on Mosaic AI Model Serving) traffic includes traffic from the 💬 Review UI, since this UI calls the REST API.

  2. Manually Metric computation for a Version using a 📖 Evaluation Set can be trigged by following Run offline evaluation with a 📖 Evaluation Set

Unstructured docs retrieval & generation metrics

Retriever

RAG Studio supports the following metrics for evaluating the retriever.

Question to answer

Metric

Per trace value

Aggregated value

Requires human annotated assessment

Where it can be measured?

Are the retrieved chunks relevant to the user’s query?

Precision of “relevant chunk” @ K

0 to 100%

0 to 100%

✔️

Online, Offline Evaluation

Are ALL chunks that are relevant to the user’s query retrieved?

Recall of “relevant chunk” @ K

0 to 100%

0 to 100%

✔️

Online, Offline Evaluation

Are the retrieved chunks returned in the correct order of most to least relevant?

nDCG of “relevant chunk” @ K

0 to 1

0 to 1

✔️

Online, Offline Evaluation

What is the latency of retrieval?

Latency

milliseconds

average(milliseconds)

n/a

Online, Offline Evaluation

Tip

🚧 Roadmap 🚧 [1] Cost [2] Do the retrieved chunks contain all the information required to answer the query? [3] Average Precision (AP) [4] Mean Average Precision (mAP) [5] Enabling 🤖 LLM Judge for retrieval metrics so they do not require a ground-truth assessment.

Generation model (for retrieval)

These metrics measure the generation model’s performance when the prompt is augemented with unstrctured docs from a retrieval step.

Question to answer

Metric

Per trace value

Aggregated value

Requires human annotated assessment

Where it can be measured?

Is the LLM responding based ONLY on the context provided? Aka not hallucinating & not using knowledge that is part of the model’s pre-training

Faithfulness (to context)

true/false

0 to 100%

✖️

Online, Offline Evaluation

Is the response on-topic given the query AND retrieved contexts?

Answer relevance (to query given the context)

true/false

0 to 100%

✖️

Online, Offline Evaluation

Tip

🚧 Roadmap 🚧 [1] Did the LLM use the correct information from each provided context? [2] Does the response answer the entirety of the query? Aka if I ask “who are bob and sam?” is the response about both bob and sam?

Data corpus

Tip

🚧 Roadmap 🚧 [1] Does my corpus contain all the information needed to answer a query? aka is the index missing any documents required to answer a specific question?

Generation model (any task) metrics

These metrics measure the generation model’s performance. They work for any prompt, augmented or non-augmented.

Question to answer

Metric

Per trace value

Aggregated value

Requires human annotated assessment

Where it can be measured?

What is the cost of the generation?

Token Count

sum(tokens)

sum(tokens)

n/a

Online, Offline Evaluation

What is the latency of generation?

Latency

milliseconds

average(milliseconds)

n/a

Online, Offline Evaluation

RAG chain metrics

These metrics measure the chain’s final response back to the user.

Question to answer

Metric

Per trace value

Aggregated value

Requires human annotated assessment

Where it can be measured?

Is the response accurate (correct)?

Answer correctness (vs. ground truth)

true/false

0 to 100%

✔️

Offline Evaluation

Does the response violate any of my company policies (racism, toxicity, etc)?

Toxicity

true/false

0 to 100%

✖️

Online, Offline Evaluation

Tip

🚧 Roadmap 🚧 [1] Total cost [2] Total latency [3] Answer similarity (to ground truth) using Spearman correlation based on cosine distance [4] Metrics based on assessor-selected reason codes (e.g., helpful, too wordy, etc) [5] User retention rate & other traditional app engagement metrics [6] Is the response inline with my company standards (proper grammar, tone of voice, etc)? [7] Additional asessments for Does the response violate any of my company policies (racism, toxicity, etc)? based on LLaMa-Guard [4] % of conversations with no negative feedback signals