Genie data sampling
This feature is in Public Preview.
This article explains how to improve accuracy in a Genie space by allowing Genie to use sample data, helping it better match user-submitted prompts. To enable data sampling, contact your Databricks account team.
Overview
When a user asks a question in Genie, they often use imprecise or conversational phrasing that does not directly match the structure or values in the data. This can lead Genie to misinterpret key terms and generate incorrect SQL queries.
For example, a user might ask:
"Show me car sales in Florida for Q1."
Genie might attempt to match the term “Florida” to a value in the state
column using a filter like ILIKE '%Florida%'
. If the data uses abbreviations (for example, FL
), this filter returns no results.
With data sampling enabled, Genie can access representative values from the state
column. This added context helps Genie recognize that FL
is the correct match for Florida
and generate a more accurate query.
The following table shows how SQL generation differs based on whether sampling is enabled in this example:
Without sampled values | With sampled values |
---|---|
|
|
Data sampling improves Genie's ability to generate correct SQL and return the expected results from the data.
Requirements
- Genie spaces must be enabled. See Set up Genie.
- Data sampling Public Preview access must be granted. Contact your Databricks account team.
- A workspace admin must enable the preview from the Previews page using the Genie Data Sampling tile.
Enabling data sampling can introduce additional latency in response generation.
Choose columns
If you have CAN EDIT privileges in a Genie space, you can select which columns to sample. Choose string columns that provide meaningful context for user prompts, especially columns with categorical or consistently formatted values.
Do not select:
- Non-string columns: Sampling is supported only for string types.
- Free-text or unstructured string columns: These often include user IDs, customer reviews, names, or other low-signal content.
Genie samples up to 255 distinct values per column. Each sampled value is truncated at 127 characters. When a column or row exceeds that maximum, only a subset is used.
Select columns to sample
- Click Configure > Data in your Genie space.
- Click a table name to view its columns.
- Click Add values next to the columns you want Genie to sample.
Sampled values are stored in your workspace's storage bucket.
To cancel an in-progress sampling operation, click the kebab menu next to the Adding values message.
If the operation failed, click Retry adding values.
After the operation is completed, use the kebab menu to:
- Refresh added values to take a new sample after the data is updated.
- Remove added values to delete the current sample.
Refresh added values
Refreshing updates the sample for a column. This is useful when:
- New values have been added to a column.
- The format of values in a column has changed.
How Genie uses sampled data
When responding to a prompt, Genie uses available metadata, comments, instructions, and, when designated, sampled row-level values. It selects relevant columns based on this context. If a selected column includes sampled values, Genie can use them to improve interpretation and query accuracy.
While sampled data is helpful for increasing Genie's accuracy, instructions and example queries are also crucial to creating an effective space. For more information, see Curate an effective Genie space.