OPTIMIZE (Delta Lake on Databricks)
Optimizes the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed.
Syntax
OPTIMIZE table_identifier [WHERE predicate]
[ZORDER BY (col_name1, col_name2, ...)]
- Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no effect. It aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. However, the two measures are most often correlated.
- Z-Ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-Ordering is not guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-Ordered, another Z-Ordering of that partition will not have any effect. It aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times. * To control the output file size, set the Spark configuration
spark.databricks.delta.optimize.maxFileSize
. The default value is1073741824
. Specifying the value134217728
sets the max output file size to 100MB.
table_identifier
[database_name.] table_name
: A table name, optionally qualified with a database name.delta.`<path-to-table>`
: The location of an existing Delta table.
WHERE
Optimize the subset of rows matching the given partition predicate. Only filters involving partition key attributes are supported.
ZORDER BY
Colocate column information in the same set of files. Co-locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to be read. You can specify multiple columns for
ZORDER BY
as a comma-separated list. However, the effectiveness of the locality drops with each additional column.
Examples
OPTIMIZE events
OPTIMIZE events WHERE date >= '2017-01-01'
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
For more information about the OPTIMIZE
command, see Optimize performance with file management.