Full-text search indexes on Unity Catalog managed tables
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
A full-text search index accelerates lookups on one or more text columns of a managed Delta Lake or Iceberg table. The index supports substring matching and word matching. When you query the table with the search or isearch functions, Databricks uses the index to skip files that are guaranteed not to contain matching rows. This significantly reduces the amount of data scanned, especially for selective lookups.
Indexes created during the Beta release aren't guaranteed to be compatible with later releases. When the feature reaches Public Preview, you must drop existing indexes and create new ones.
Requirements
Full-text search indexes have requirements for compute, base table and schema permissions, and base table configuration.
Compute
Full-text search indexes are available only in Databricks Runtime 18.2 and above, and you must enable this Beta feature in your workspace settings. See Manage Databricks previews.
Permissions
To create a search index:
- You must have the
MODIFYpermission for the table referenced in the search index. - You must have the
CREATE TABLEpermission on the parent schema. A schema owner or user with the MANAGE privilege can grant youCREATE TABLEprivileges on the schema.
Table configuration
Before you create a full-text search index, the base table must satisfy all of the following:
- You must create the index in the same catalog and schema as the base table.
- The table is a managed Delta Lake table or a managed Iceberg table.
- Row tracking is enabled (
delta.enableRowTracking = true). See Row tracking in Databricks. - Indexed columns are of type
STRING,VARIANT,STRUCT, orARRAY.STRINGcolumns use theUTF8_BINARYcollation. - A
STRUCTcolumn contains at least oneSTRING,VARIANT, orARRAYleaf field at any nesting depth; other leaf fields are ignored. - The table doesn't use any features from the list of limitations, including: Delta Sharing, shallow cloning, attribute-based access controls, row-level security policies, and column masks. See Limitations.
For information about table protocol requirements, which apply to both Delta Lake and Iceberg tables, see Delta Lake feature compatibility and protocols.
Create a full-text search index
Use CREATE SEARCH INDEX to create an index over one or more text columns. The following example indexes two text columns of an existing log table:
CREATE SEARCH INDEX log_idx
ON logs (message, error_detail);
The full syntax is:
CREATE SEARCH INDEX [IF NOT EXISTS] index_name
ON table_name ( column_name [, column_name ...] )
[OPTIONS ( option_key = option_value [, ... ] )]
index_name must be unique within the schema and can't match an existing table name.
To control how the text is tokenized, see Options.
If CREATE SEARCH INDEX and REFRESH INDEX fail mid-execution, run REFRESH INDEX to recover from a partial failure.
Options
The OPTIONS clause accepts the following keys:
Key | Values | Default | Description |
|---|---|---|---|
|
|
| How the text is tokenized for indexing. See Select a tokenizer for your use case. |
| integer in |
| Length of the n-grams produced. Only valid when |
| integer |
| Minimum length of tokens to keep. Tokens shorter than this are dropped during indexing. Only valid when |
For detailed information about invalid option errors, see SEARCH_INDEX_INVALID_PARAMETERS error condition.
Select a tokenizer for your use case
Search indexes have 2 tokenizer options available, depending on your use case:
Tokenizer | Use case | Description |
|---|---|---|
| Substring matching. | Splits text into overlapping n-grams of length |
| Whole-word containment checks. | Splits text into word tokens. A token is a run of Unicode letters ( |
To create an n-gram index with an n-gram size of 4:
CREATE SEARCH INDEX log_ngram_idx
ON logs (message)
OPTIONS (tokenizer = 'ngram', ngram_size = 4);
To create a split index with a minimum token length of 2:
CREATE SEARCH INDEX log_word_idx
ON logs (message)
OPTIONS (tokenizer = 'split', min_token_length = 2);
Query data using search and isearch
Databricks has two SQL functions to test whether a search pattern is present in one or more text targets:
search: Case-sensitive.isearch: Case-insensitive.
Select search or isearch based on your case-sensitivity requirement. When the indexed columns are covered by a full-text search index, Databricks uses the index to skip files guaranteed not to contain matching rows. Search indexes do not affect results.
Indexes accelerate queries the most when the search pattern appears in a small fraction of the table's files.
search( target [, target ... ] , 'pattern' [, mode => 'substring' | 'word' ] )
isearch( target [, target ... ] , 'pattern' [, mode => 'substring' | 'word' ] )
Arguments
search and isearch accept the following arguments:
targetmust be of typeSTRING,VARIANT,STRUCT, orARRAY, the same types that indexing allows. Targets are deduplicated.patternmust be a non-null string literal.modespecifies howpatternmatches eachtarget:substring(default):patternis matched as a substring within eachtarget.word:patternis split into word tokens using the same rule as thesplittokenizer. The function returns true if every word inpatternappears in at least one target, regardless of order. See Select a tokenizer for your use case.
Returns
search and isearch return a BOOLEAN value with three-valued logic:
trueif at least one non-null target matches.nullif no non-null target matches but at least one target isnull.falseif all targets are non-null and none match.
Examples
The following examples show common search and isearch queries:
-- Case-insensitive substring search across one column.
SELECT * FROM logs
WHERE isearch(message, 'connection refused');
-- Case-sensitive substring search across multiple columns.
SELECT * FROM logs
WHERE search(message, error_detail, '550e8400-e29b-41d4-a716-446655440000');
-- Word search: matches rows containing all three words, in any order.
SELECT * FROM audit_logs
WHERE search(message, 'user admin login', mode => 'word');
Manage indexes
Full-text search indexes don't update automatically when the base table changes. See Refresh an index.
Databricks maintains query correctness, regardless of index freshness. When a table contains non-indexed data, the query uses the existing index to accelerate access to the indexed records and uses a table scan for the non-indexed records.
Use the following operations to manage full-text search indexes:
Describe or view an index
To view information about an index:
DESCRIBE INDEX log_idx;
Refresh an index
Full-text search indexes don't update automatically when the base table changes.
To update the index, adding entries for new rows:
REFRESH INDEX log_idx;
REFRESH INDEX is an incremental, append-only operation. It indexes new data but doesn't remove entries for deleted rows.
To update the index, both adding entries for new rows and removing entries for deleted rows, use REFRESH INDEX ... FULL:
REFRESH INDEX log_idx FULL;
A full refresh requires more compute resources than an incremental refresh. Over time, incremental refreshes accumulate stale entries, which increase the size of the index and negatively affect performance.
Drop an index
To drop an index, run the following:
DROP INDEX log_idx;
To avoid an error for missing indexes, use:
DROP INDEX IF EXISTS log_idx;
If you drop the base table, the command also drops the full-text search indexes.
Limitations
Full-text search indexes have the following limitations:
- Renaming an indexed column on the base table, or changing its data type, is not supported.
- Tables with Delta Sharing are not supported. If you add the base table as a Delta Sharing source or target after creating the index, Databricks ignores the search index.
- Tables with shallow clones are not supported. If you add the base table as a shallow clone source after creating the index, Databricks ignores the search index.
- Tables with attribute-based access controls, column masks, or row-level security policies are not supported. If you add any of these controls to a table with a search index, Databricks ignores the search index. See Core concepts for attribute-based access control (ABAC).