Data skipping for Delta Lake
Note
In Databricks Runtime 13.3 and above, Databricks recommends using liquid clustering for Delta table layout. Clustering is not compatible with Z-ordering. See Use liquid clustering for Delta tables.
Data skipping information is collected automatically when you write data into a Delta table. Delta Lake on Databricks takes advantage of this information (minimum and maximum values, null counts, and total records per file) at query time to provide faster queries.
You must have statistics collected for columns that are used in ZORDER
statements. See What is Z-ordering?.
Specify Delta statistics columns
By default, Delta Lake collects statistics on the first 32 columns defined in your table schema. For this collection, each field in a nested column is considered an individual column. You can modify this behavior by setting one of the following table properties:
Table property |
Databricks Runtime supported |
Description |
---|---|---|
|
All supported Databricks Runtime versions |
Increase or decrease the number of columns on which Delta collects statistics. Depends on column order. |
|
Databricks Runtime 13.3 LTS and above |
Specify a list of column names for which Delta Lake collects statistics. Supersedes |
Table properties can be set at table creation or with ALTER TABLE
statements. See Delta table properties reference.
Updating this property does not automatically recompute statistics for existing data. Rather, it impacts the behavior of future statistics collection when adding or updating data in the table. Delta Lake does not leverage statistics for columns not included in the current list of statistics columns.
In Databricks Runtime 14.3 LTS and above, you can manually trigger the recomputation of statistics for a Delta table using the following command:
ANALYZE TABLE table_name COMPUTE DELTA STATISTICS
Note
Long strings are truncated during statistics collection. You might choose to exclude long string columns from statistics collection, especially if the columns aren’t used frequently for filtering queries.
What is Z-ordering?
Note
Databricks recommends using liquid clustering for all new Delta tables. You cannot use ZORDER
in combination with liquid clustering.
Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the ZORDER BY
clause:
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use ZORDER BY
.
You can specify multiple columns for ZORDER BY
as a comma-separated list. However, the effectiveness of the locality drops with each extra column. Z-ordering on columns that do not have statistics collected on them would be ineffective and a waste of resources. This is because data skipping requires column-local stats such as min, max, and count. You can configure statistics collection on certain columns by reordering columns in the schema, or you can increase the number of columns to collect statistics on.
Note
Z-ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-ordering is not guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-ordered, another Z-ordering of that partition will not have any effect.
Z-ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times.
For example, if you
ZORDER BY
date and your most recent records are all much wider (for example longer arrays or string values) than the ones in the past, it is expected that theOPTIMIZE
job’s task durations will be skewed, as well as the resulting file sizes. This is, however, only a problem for theOPTIMIZE
command itself; it should not have any negative impact on subsequent queries.