OPTIMIZE

Applies to: Databricks SQL Databricks Runtime

OPTIMIZE optimizes the layout of Delta Lake data. You can optimize a subset of data or collocate data by column. If you don't specify collocation and the table doesn't use liquid clustering, Delta Lake performs bin-packing optimization.

Syntax

OPTIMIZE table_name [FULL] [WHERE predicate]
  [ZORDER BY (col_name1 [, ...] ) ]

note

Bin-packing optimization is idempotent: if you run it twice on the same data set, the second run has no effect. It produces evenly-balanced data files with respect to their size on disk, but not necessarily the number of tuples per file. The two measures are most often correlated.

Z-Ordering is not idempotent, but operates incrementally. The time Z-Ordering takes is not guaranteed to decrease over multiple runs. However, if no new data was added to a partition that was just Z-Ordered, running Z-Ordering again on that partition has no effect. Z-Ordering produces evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but skew in optimize task times can occur when they diverge.

note

When using Databricks Runtime, to control the output file size, set the Spark configuration spark.databricks.delta.optimize.maxFileSize. The default value is 1073741824 (1 GB). Specifying 104857600 sets the file size to 100 MB.

Parameters

table_name

Identifies an existing Delta table. The name must not include a temporal specification or options specification.
FULL

Applies to: Databricks Runtime 16.0 and above

Rewrites all data files in the table. Use OPTIMIZE table_name FULL to:
- Optimize the whole table, including data that was previously clustered (for tables using liquid clustering).
- Recompress existing data files when you change the table's compression codec using the delta.parquet.compression.codec property.
To recompress existing data after changing the compression codec, run OPTIMIZE table_name FULL:
SQL
```
-- Change compression codec
ALTER TABLE table_name SET TBLPROPERTIES ('delta.parquet.compression.codec' = 'ZSTD');

-- Recompress all existing data files
OPTIMIZE table_name FULL;
```
To optimize a subset of files in a table with liquid clustering enabled, combine FULL with a WHERE predicate (Databricks Runtime 18.1 and above). Only simple range predicates on a single clustering column are supported. A file is included if any part of its range overlaps with the predicate. For example, OPTIMIZE events FULL WHERE date > 15 includes a file with range (col_min = 10, col_max = 20) because the file's range overlaps with the predicate.
WHERE

Optimizes the subset of rows matching a partition / clustering predicate. Only filters on partition / clustering key attributes are supported.

For tables that use liquid clustering, use OPTIMIZE table_name FULL WHERE predicate (Databricks Runtime 18.1 and above) instead.
ZORDER BY

note
Databricks recommends liquid clustering for all new tables instead of Z-ordering. See Use liquid clustering for tables.

Collocates column information in the same set of files. Delta Lake data-skipping algorithms use co-locality to reduce the amount of data that needs to be read. You can specify multiple columns as a comma-separated list, but the effectiveness of co-locality decreases with each additional column.

You can't use this clause on tables that use liquid clustering.

Examples

SQL
> OPTIMIZE events;

> OPTIMIZE events FULL;

> -- Partitioned table
> OPTIMIZE events WHERE date >= '2017-01-01';

> -- Table with liquid clustering enabled (DBR 18.1 and above)
> OPTIMIZE events FULL WHERE date >= '2025-01-01';

> OPTIMIZE events
    WHERE date >= current_timestamp() - INTERVAL 1 day
    ZORDER BY (eventType);

For more information, see Optimize data file layout.

Syntax​

Parameters​

Examples​

Syntax

Parameters

Examples