Create an 📖 Evaluation Set


This feature is in Private Preview. To try it, reach out to your Databricks contact.

Looking for a different RAG Studio doc? Go to the RAG documentation index

This tutorial walks you through the process of creating a 📖 Evaluation Set to evaluate a RAG Application’s quality/cost/latency.

This evaluation set allows you to quickly and quantitatively check the quality of a new version of your application before distributing it to stakeholders for their feedback.

Step 1: Create an evaluation set with only questions

You can either collect 🗂️ Request Logs or manually curate questions.

To collect 🗂️ Request Logs:

  1. Use the 💬 Review UI to ask the RAG Application questions.

  2. Run the following SQL to create a Unity Catalog table called <eval_table_name>. This table can be stored in any Unity Catalog schema, but we suggest storing it in the Unity Catalog schema you configured for the RAG Application.


    You can modify the SQL code to only select a subset of logs. If you do this, make sure you keep the original schema of the request column.

    CREATE TABLE <eval_table_name> CLONE <catalog>.<schema>.rag_studio_<app_name>_<environment>_eval_dataset_template
    INSERT INTO <eval_table_name> SELECT request FROM <catalog>.<schema>.<request_log> LIMIT 5


    The schema of request is intentionally the same between the request logs and the evaluation set.

To manually curate questions:

  1. Clone the <catalog>.<schema>.rag_studio_<app_name>_<environment>_eval_dataset_template table to create a new table called <eval_table_name>.

    CREATE TABLE <eval_table_name> CLONE <catalog>.<schema>.rag_studio_<app_name>_<environment>_eval_dataset_template
  2. Add questions to the <eval_table_name>, ensuring that the request column has the same schema as shown below.

      "request_id": "c20cb3a9-23d0-48ac-a2cb-98c47a0b84e2",
      "conversation_id": null,
      "timestamp": "2024-01-18T23:22:52.701Z",
      "messages": [
          "role": "user",
          "content": "Hello how are you"
      "last_input": "Hello how are you"


    the messages array follows the OpenAI messages format. You can include any number of role/content pairs.

Step 2: Optionally - add ground truth data for each question

Databricks suggests adding ground-truth answers and retrieved contexts to the questions you just created - this will allow you to more accurately measure the quality of your application. However, this step is optional and you can still use RAG Studio’s functionality without doing so - the only missing functionality is the computation of a answer correctness metric + retrieval metrics.

  1. Open a Databricks Notebook

  2. Create a Spark dataframe with the following schema

    from pyspark.sql.types import StructType, StringType, StructField
    schema = StructType([StructField('id', StringType(), True), StructField('answer', StringType(), True), StructField('doc_ids', StringType(), True)])
    labeled_df = spark.createDataFrame([], schema)
  3. For each request_id in your <eval_table_name> from above, add a ground truth answer to the Dataframe.

  4. Append the ground-truth labels to your <eval_table_name>:

    %pip install ""
    from databricks.rag.utils import add_labels_to_eval_dataset
    add_labels_to_eval_dataset(labeled_df, "<eval_table_name>")

Follow the next tutorial!

Deploy a RAG application to production