Run offline evaluation with a 📖 Evaluation Set

Preview

This feature is in Private Preview. To try it, reach out to your Databricks contact.

Looking for a different RAG Studio doc? Go to the RAG documentation index

This tutorial walks you through the process of using a 📖 Evaluation Set to evaluate a RAG Application’s quality/cost/latency. This step is performed after you create a new version of your RAG Application to evaluate if you improved the application’s performance and didn’t cause any performance regressions.

In this tutorial, we use a sample evaluation set provided by Databricks - in the next tutorials, we walk you through the process of collecting user feedback in order to create your own evaluation set.

Data flow

legend

Step 1: Load the sample evaluation set

  1. Open a Databricks Notebook and run the following code to save the sample 📖 Evaluation Set to a Unity Catalog schema.

    Note

    RAG Studio supports using 📖 Evaluation Sets that are stored in any Unity Catalog schema, but for organizational purposes, Databricks suggest keeping your evaluation sets in the same Unity Catalog schema as your RAG Application.

    import requests
    import pandas as pd
    from pyspark.sql.types import StructType, StructField, StringType, TimestampType, ArrayType
    
    schema = StructType([StructField('request', StructType([StructField('request_id', StringType(), True), StructField('conversation_id', StringType(), True), StructField('timestamp', TimestampType(), True), StructField('messages', ArrayType(StructType([StructField('role', StringType(), True), StructField('content', StringType(), True)]), True), True), StructField('last_input', StringType(), True)]), True), StructField('ground_truth', StructType([StructField('text_output', StructType([StructField('content', StringType(), True)]), True), StructField('retrieval_output', StructType([StructField('name', StringType(), True), StructField('chunks', ArrayType(StructType([StructField('doc_uri', StringType(), True), StructField('content', StringType(), True)]), True), True)]), True)]), True)])
    
    df = spark.createDataFrame(
        pd.read_json("http://docs.databricks.com/_static/notebooks/rag-studio/example_evaluation_set.json"),
        schema,
    )
    df.write.format("delta").mode("overwrite").saveAsTable(
        'catalog.schema.example_evaluation_set'
    )
    
  2. Inspect the loaded 📖 Evaluation Set to understand the schema.

    Note

    You will notice that the request schema is identical to the request schema in the <rag-response-log>. This is intentional to allow you to easily translate request logs into evaluation sets.

    • request: the user’s input to the RAG Application

    • ground_truth: the ground truth label for the response and retrieval steps

    logs

Step 2: Run offline evaluation to compute metrics

  1. Run the evaluation set through version 1 of the application that you created in the first tutorial by running the following command in your console. This job takes about 10 minutes to complete.

    ./rag run-offline-eval  --eval-table-name catalog.schema.example_evaluation_set -v 1 -e dev
    

    Note

    What happens behind the scenes?

    In the background, the Chain version 1 is run through each row of the catalog.schema.example_evaluation_set using an identical compute environment to how your Chain is served. For each row of catalog.schema.example_evaluation_set:

    • A row is written to a 🗂️ Request Log called catalog.schema.example_evaluation_set_request_log inside the same Unity Catalog schema as the evaluation set

    • A row is written to the 👍 Assessment & Evaluation Results Log called catalog.schema.example_evaluation_set_assessment_log with <llm-judge> assessments and metric computations

    The name of the tables are based on the name of the input evaluation set table.

    Note that the schema of the 🗂️ Request Log and 👍 Assessment & Evaluation Results Log are intentionally identical to the logs you viewed in the view logs tutorial.

    results

Step 3: Open the metrics UI

  1. Run the following command to open the metrics Notebook. This job takes about 10 minutes to complete.

    Note

    If you have multiple versions of your application, you can run step 2 for each version, and then pass --versions 2,3,4 or --versions * to compare the different versions within the notebook.

    ./rag explore-eval --eval-table-name catalog.schema.example_evaluation_set -e dev --versions 1
    
  2. Click on the URL that is provided in the console output.

  3. Click to open the Notebook associated with the Databricks Job.

    results
  4. Run the first 2 cells to populate the widgets and then fill in the names of the tables from step 2.

    • assessment_log_table_name: catalog.schema.example_evaluation_set_assessment_log

    • request_log_table_name: catalog.schema.example_evaluation_set_request_log

  5. Run all cells in the notebook to display the metrics computed from the evaluation set.

    results

Follow the next tutorial!

Collect feedback from 🧠 Expert Users