Data Skipping Index¶
New in version runtime-3.0.
See Cluster Image Release Notes for more information.
In addition to Partition Pruning, Databricks Runtime includes another feature that is meant to avoid scanning irrelevant data, namely the Data Skipping Index. The idea behind it is that we can use file-level statistics in order to perform additional skipping at file granularity. This works with, but does not depend on, Hive-style partitioning.
The effectiveness of Data Skipping depends on the characteristics of your data and its physical layout. As skipping is done at file granularity, it is important that your data is horizontally partitioned across multiple files. This will typically happen as a consequence of having multiple append jobs, (shuffle) partitioning, bucketing, and/or the use of
spark.sql.files.maxRecordsPerFile. It works best on tables with sorted buckets (
CREATE TABLE ... CLUSTERED BY ... SORTED BY ...), or with columns that are correlated with partition keys (e.g. brandName - modelName, companyID - stockPrice), but also when your data just happens to exhibit some sortedness / clusteredness (e.g. orderID, bitcoinValue).
This feature is currently in beta and has a number of limitations:
- It’s Opt-In: needs to be enabled manually, on a per-table basis.
- It’s SQL only: there is no DataFrame API for it.
- Beware: Once a table is indexed, the effects of subsequent
ADD PARTITIONoperations will not be visible until the index is REFRESHed.
CREATE DATASKIPPING INDEX ON [TABLE] [db_name.]table_name
Enables Data Skipping on the given table for all supported columns.
Refresh Full Index¶
REFRESH DATASKIPPING INDEX ON [TABLE] [db_name.]table_name
Rebuilds the whole index. I.e. all the table’s partitions will be re-indexed.
REFRESH DATASKIPPING INDEX ON [TABLE] [db_name.]table_name PARTITION (part_col_name1[=val1], part_col_name2[=val2], ...)
Re-indexes the specified partitions only. This operation should generally be faster than full index refresh.
DROP DATASKIPPING INDEX ON [TABLE] [db_name.]table_name
Disables Data Skipping on the given table and deletes all index data.