Skip to main content

Use benchmarks in a Genie space

This page explains how to use benchmarks to evaluate the accuracy of your Genie space.

Overview

Benchmarks allow you to create a set of test questions that you can run to assess Genie's overall response accuracy. A well-designed set of benchmarks covering the most frequently asked user questions helps evaluate the accuracy of your Genie space as you refine it.

Benchmark questions run as new conversations. They do not carry the same context as a threaded Genie conversation. Each question is processed as a new query, using the instructions defined in the space, including any provided example SQL and SQL functions.

Example benchmarks with accuracy reported on nine questions.

Add benchmark questions

Benchmark questions should reflect different ways of phrasing the common questions your users ask. You can use them to check Genie's response to variations in question phrasing or different question formats.

When creating a benchmark question, you can optionally include a SQL query whose result set is the correct answer. During benchmark runs, accuracy is assessed by comparing the result set from your SQL query to the one generated by Genie.

To add a benchmark question:

  1. Click the Kebab menu icon. kebab menu in the upper-right corner of the Genie space. Then, click Benchmarks.

  2. Click Add benchmark.

  3. In the Question field, enter a benchmark question to test.

  4. (Optional) Enter the SQL statement that accurately answers the question you entered.

    note

    This step is recommended. Only questions that include this example SQL statement can be automatically assessed for accuracy. Any questions that do not include a SQL Answer require manual review to be scored.

  5. (Optional) Click Run to run your query and view the results.

  6. When you're finished editing, click Add benchmark.

  7. To update a question after saving, click the Edit icon pencil icon to open the Update question dialog.

Use benchmarks to test alternate question phrasings

When evaluating the accuracy of your Genie space, it's important to structure tests to reflect realistic scenarios. Users may ask the same question in different ways. Databricks recommends adding multiple phrasings of the same question and using the same example SQL in your benchmark tests to fully assess accuracy. Most Genie spaces should include 2 - 4 phrasings of the same question.

Run benchmark questions

Users with at least CAN EDIT permissions in a Genie space can run a benchmark evaluation at any time. Running a benchmark evaluation automatically runs all benchmark questions.

For each question, Genie interprets the input, generates SQL, and returns results. The generated SQL and results are then compared against the SQL Answer defined in the benchmark question.

To run all benchmark questions:

  1. Click the Kebab menu icon. kebab menu in the upper-right corner of the Genie space. Then, click Benchmarks.
  2. Click Run benchmarks to start the test run.
note

If you close this page, the benchmark run automatically pauses. You can resume the test when you reopen the page.

Interpret ratings

The following criteria determine how Genie responses are rated:

Condition

Rating

Genie generates SQL that exactly matches the provided SQL Answer

Good

Genie generates a result set that exactly matches the result set produced by the SQL Answer

Good

Genie generates a result set that includes extra columns compared to the result set produced by the SQL Answer

Good

Genie generates a result set with the same data as the SQL Answer but sorted differently

Good

Genie generates a result set with numeric values that round to the same 4 significant digits as the SQL Answer

Good

Genie generates SQL that produces an empty result set or returns an error

Bad

Genie generates a single cell result that's different from the single cell result produced by the SQL Answer

Bad

Manual review needed: Responses are marked with this label when Genie cannot assess correctness or when Genie-generated query results do not contain an exact match with the results from the provided SQL Answer. Any benchmark questions that do not include a SQL Answer must be reviewed manually.

Access benchmark evaluations

You can access all of your benchmark evaluations to track accuracy in your Genie space over time. When you open a space's Benchmarks, a timestamped list of evaluation runs appears in the Evaluations tab. If no evaluation runs are found, see Add benchmark questions or Run benchmark questions.

Evaluations screen as described in the text that follows.

The Evaluations tab shows an overview of evaluations and their performance reported in the following categories:

Evaluation name: A timestamp that indicates when an evaluation run occured. Click the timestamp to see details for that evaluation. Execution status: Indicates if the evaluation is completed, paused, or unsuccessful. If an evaluation run includes benchmark questions that do not have predefined SQL answers, it is marked for review in this column. Accuracy: A numeric assessment of accuracy across all benchmark questions. For evaluation runs that require manual review, an accuracy measure appears only after those questions have been reviewed. Created by: Indicates the name of the user who ran the evaluation.

Review individual evaluations

You can review individual evaluations to get a detailed look at each response. You can edit the assessment for any question and update any items that need manual review.

To review individual evaluations:

  1. Click the Kebab menu icon. kebab menu in the upper-right corner of the Genie space. Then, click Benchmarks.

  2. Click the timestamp for any evaluation in the Evaluation name column to open a detailed view of that test run.

    A screen that shows the results of a single evaluation run. All questions are listed on the left. If applicable, individual questions are shown on the right with the model output and the ground truth output.

  3. Click a question near the left side of the screen to see the associated details. Use the evaluation detail screen perform the next steps.

  4. Review and compare the Model output response with the Ground truth response.

    note

    The results of these responses appear in the the evaluation details for one week. After one week, the results are no longer visible. The generated SQL statement and the example SQL statement remain.

  5. Click the Edit icon on the label to edit the assessment.

    Mark each result as Good or Bad to get an accurate score for this evaluation.