Skip to main content

Secondary indexes on Unity Catalog managed tables

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

A secondary index accelerates selective lookups on a column of a managed Delta Lake or Iceberg table. When the Databricks query engine determines that a secondary index can help, it automatically uses the index to skip files and rows that can't contain matching data. Indexes might significantly reduce the amount of data scanned for large tables.

Secondary indexes complement Liquid clustering, which organizes the physical layout of your data. A secondary index allows for additional query optimizations without changing the table's physical layout. This is useful when your workload requires selective lookups on a column that isn't part of the clustering key.

important

Indexes created during Beta aren't compatible with indexes created in later releases. When the feature reaches Public Preview, you must drop existing indexes and create new ones.

Requirements

Secondary indexes have requirements for compute, base table and schema permissions, and base table configuration.

Compute

Secondary indexes are available only in Databricks Runtime 18.2 and above, and you must enable this Beta feature in your workspace settings. See Manage Databricks previews.

Permissions

To create a secondary index:

  • You must have the MODIFY permission for the table referenced in the index.
  • You must have the CREATE TABLE permission on the parent schema. A schema owner or user with the MANAGE privilege can grant you CREATE TABLE privileges on the schema.

Table configuration

Before you create a secondary index, the base table must satisfy all of the following:

  • You must create the index in the same catalog and schema as the base table.
  • The table is a managed Delta Lake table or a managed Iceberg table.
  • Row tracking is enabled (delta.enableRowTracking = true). See Row tracking in Databricks.
  • The column to be indexed must be of a supported data type: numeric types, STRING, CHAR(n), VARCHAR(n), or TIMESTAMP.
  • The table doesn't use any features from the list of limitations, including: Delta Sharing, shallow cloning, attribute-based access controls, row-level security policies, and column masks. See Limitations.

For information about table protocol requirements, which apply to both Delta Lake and Iceberg tables, see Delta Lake feature compatibility and protocols.

Create a secondary index

You can create up to four indexes on a single table, each on a different column. The four-index limit is shared by both secondary indexes and full-text search indexes. See Full-text search indexes on Unity Catalog managed tables.

Use CREATE INDEX to create a secondary index on a single column, for example, on the user_id column of an existing table, events:

SQL
CREATE INDEX user_id_idx
ON events (user_id);

The full syntax is:

SQL
CREATE INDEX [IF NOT EXISTS] index_name
ON table_name ( column_name )

index_name must be unique within the parent schema and can't match an existing table name.

warning

If CREATE INDEX or REFRESH INDEX fail mid-execution, run REFRESH INDEX to recover from a partial failure.

Query data using a secondary index

When a secondary index exists on a table, the Databricks query engine automatically uses it to accelerate queries with selective predicates on the indexed column.

Supported predicate types include:

  • Equality: =
  • Set membership: IN

For example, if you have a user_id_idx secondary index on column user_id, the index accelerates queries that perform lookups for a specific user:

SQL
SELECT * FROM events
WHERE user_id = '019982fe-002a-7621-8c20-b332eeb71f44';

Manage indexes

important

If the base table changes, secondary indexes don't update automatically. See Refresh an index.

Databricks maintains query correctness, regardless of index freshness. When a table contains non-indexed data, the query uses the existing index to accelerate access to the indexed records and uses a table scan for the non-indexed records.

Use the following operations to manage secondary indexes:

Describe or view an index

To view information about an index:

SQL
DESCRIBE INDEX user_id_idx;

Refresh an index

When the base table changes, secondary indexes don't update automatically.

To update the index, adding entries for new rows:

SQL
REFRESH INDEX user_id_idx;

REFRESH INDEX is an incremental, append-only operation. It indexes new data but doesn't remove entries for deleted rows.

To update the index, both adding entries for new rows and removing entries for deleted rows, use REFRESH INDEX ... FULL:

SQL
REFRESH INDEX user_id_idx FULL;

A full refresh requires more compute resources than an incremental refresh. Over time, incremental refreshes accumulate stale entries, which increase the size of the index and negatively affect performance.

Drop an index

To drop an index, run the following:

SQL
DROP INDEX user_id_idx;

To avoid an error for missing indexes, use:

SQL
DROP INDEX IF EXISTS user_id_idx;
note

If you drop the base table, the command also drops the secondary indexes.

Limitations

Secondary indexes have the following limitations:

  • You can't rename an indexed column on the base table, or change its data type. Instead, you must drop the index, alter the column, and then recreate the index.
  • Tables with Delta Sharing are not supported. If you add the base table as a Delta Sharing source or target after creating the index, Databricks ignores the secondary index.
  • Tables with shallow clones are not supported. If you add the base table as a shallow clone source after creating the index, Databricks ignores the secondary index.
  • Tables with attribute-based access controls, column masks, or row-level security policies are not supported. If you add any of these controls to a table with a secondary index, Databricks ignores the secondary index. See Core concepts for attribute-based access control (ABAC).