Skip to main content

Genie data sampling

Preview

This feature is in Public Preview.

This article explains how to improve accuracy in a Genie space by allowing Genie to use sample data, helping it better match user-submitted prompts. To enable data sampling, contact your Databricks account team.

Overview

When a user asks a question in Genie, they often use imprecise or conversational phrasing that does not directly match the structure or values in the data. This can lead Genie to misinterpret key terms and generate incorrect SQL queries.

For example, a user might ask:

"Show me car sales in Florida for Q1."

Genie might attempt to match the term “Florida” to a value in the state column using a filter like ILIKE '%Florida%'. If the data uses abbreviations (for example, FL), this filter returns no results.

With data sampling enabled, Genie can access representative values from the state column. This added context helps Genie recognize that FL is the correct match for Florida and generate a more accurate query.

The following table shows how SQL generation differs based on whether sampling is enabled in this example:

Without sampled values

With sampled values

WHERE state ILIKE '%Florida%'

WHERE state = 'FL'

Data sampling improves Genie's ability to generate correct SQL and return the expected results from the data.

Requirements

  • Genie spaces must be enabled. See Set up Genie.
  • Data sampling Public Preview access must be granted. Contact your Databricks account team.
  • A workspace admin must enable the preview from the Previews page using the Genie Data Sampling tile.
note

Enabling data sampling can introduce additional latency in response generation.

Choose columns

If you have CAN EDIT privileges in a Genie space, you can select which columns to sample. Choose string columns that provide meaningful context for user prompts, especially columns with categorical or consistently formatted values.

Do not select:

  • Non-string columns: Sampling is supported only for string types.
  • Free-text or unstructured string columns: These often include user IDs, customer reviews, names, or other low-signal content.
note

Genie samples up to 255 distinct values per column. Each sampled value is truncated at 127 characters. When a column or row exceeds that maximum, only a subset is used.

Select columns to sample

  1. Click Configure > Data in your Genie space.
  2. Click a table name to view its columns.
  3. Click Add values next to the columns you want Genie to sample.

A string column with the add values button on the right.

Sampled values are stored in your workspace's storage bucket.

To cancel an in-progress sampling operation, click the Kebab menu kebab menu next to the Adding values message.

If the operation failed, click Retry adding values.

After the operation is completed, use the Kebab menu kebab menu to:

  • Refresh added values to take a new sample after the data is updated.
  • Remove added values to delete the current sample.

Refresh values or remove values options in the UI

Refresh added values

Refreshing updates the sample for a column. This is useful when:

  • New values have been added to a column.
  • The format of values in a column has changed.

How Genie uses sampled data

When responding to a prompt, Genie uses available metadata, comments, instructions, and, when designated, sampled row-level values. It selects relevant columns based on this context. If a selected column includes sampled values, Genie can use them to improve interpretation and query accuracy.

While sampled data is helpful for increasing Genie's accuracy, instructions and example queries are also crucial to creating an effective space. For more information, see Curate an effective Genie space.