メインコンテンツまでスキップ

OPTIMIZE

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Optimizes the layout of Delta Lake data. Optionally optimize a subset of data or collocate data by column. If you do not specify collocation and the table is not defined with liquid clustering, bin-packing optimization is performed.

Syntax

OPTIMIZE table_name [FULL] [WHERE predicate]
[ZORDER BY (col_name1 [, ...] ) ]
注記
  • Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no effect. It aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. However, the two measures are most often correlated.
  • Z-Ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-Ordering is not guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-Ordered, another Z-Ordering of that partition will not have any effect. It aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times.
注記

While using Databricks Runtime, to control the output file size, set the Spark configuration spark.databricks.delta.optimize.maxFileSize. The default value is 1073741824, which sets the size to 1 GB. Specifying the value 104857600 sets the file size to 100 MB.

important

When you change a table's compression codec using the delta.parquet.compression.codec property, existing data files are not automatically recompressed. To recompress existing data with your chosen compression format, run OPTIMIZE table_name FULL (Databricks Runtime 16.0 and above):

SQL
-- Change compression codec
ALTER TABLE table_name SET TBLPROPERTIES ('delta.parquet.compression.codec' = 'ZSTD');

-- Recompress all existing data files
OPTIMIZE table_name FULL;

Parameters

  • table_name

    Identifies an existing Delta table. The name must not include a temporal specification or options specification.

  • FULL

    Applies to: check marked yes Databricks Runtime 16.0 and above

    Rewrites all data files in the table. Use OPTIMIZE table_name FULL to:

    • Optimize the whole table including data that might have previously been clustered (for tables using liquid clustering).
    • Recompress existing data files when you change the table's compression codec using the delta.parquet.compression.codec property.
  • WHERE

    Optimize the subset of rows matching the given partition predicate. Only filters involving partition key attributes are supported.

    You cannot use this clause on tables that use liquid clustering.

  • ZORDER BY

    Collocate column information in the same set of files. Co-locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to be read. You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality decreases with each additional column.

    You cannot use this clause on tables that use liquid clustering.

Examples

SQL
> OPTIMIZE events;

> OPTIMIZE events FULL;

> OPTIMIZE events WHERE date >= '2017-01-01';

> OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType);

For more information about the OPTIMIZE command, see Optimize data file layout.