Optimizing Performance and Cost

To improve query speed, Databricks Delta supports the ability to optimize the layout of data stored in DBFS. Databricks Delta supports two layout algorithms: bin-packing and ZOrdering.

This topic describes how to run the optimization commands and how the two layout algorithms work. It includes an example notebook that demonstrates the benefits of optimization.

This topic also describes how to clean up stale table snapshots.

The FAQ explains why optimization is not automatic. It includes recommendations for how often to run optimize commands.

Compaction (bin-packing)

Databricks Delta can improve the speed of read queries from a table by coalescing small files into larger ones. You trigger compaction by running the OPTIMIZE command:

OPTIMIZE delta.`/data/events`

or

OPTIMIZE events

If you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition predicate using WHERE:

OPTIMIZE events WHERE date >= '2017-01-01'

Note

Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second instance has no effect.

Readers of Databricks Delta tables use snapshot isolation, which means that they are not interrupted when OPTIMIZE removes unnecessary files from the transaction log. OPTIMIZE makes no data related changes to the table, so a read before and after an OPTIMIZE has the same results. Performing OPTIMIZE on a table that is a streaming source does not affect any current or future streams that treat this table as a source.

ZOrdering (multi-dimensional clustering)

ZOrdering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Databricks Delta data-skipping algorithms to dramatically reduce the amount of data that needs to be read. To ZOrder data, you specify the columns to order on in the ZORDER BY clause:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional column.

Note

ZOrdering is not an idempotent operation, and rearranges all of the data that matches the given filter. Therefore we suggest that you limit it to new data, using partition filters when possible.

For an example of the benefits of optimization, see the following notebook:

Data skipping

Data skipping information is collected automatically when you write data into a Databricks Delta table. Databricks Delta takes advantage of this information (min and max values) at query time to provide faster queries. You do not need to configure data skipping.

Garbage collection

To ensure that concurrent readers can continue reading a stale snapshot of a table, Databricks Delta leaves deleted files on DBFS for a period of time. To save on storage costs you should occasionally clean up these invalid files using the VACUUM command:

VACUUM delta.`/data/events`

or

VACUUM events

You can also specify DRY RUN to test the vacuum and return a list of files to be deleted:

VACUUM events DRY RUN

The VACUUM command removes any files that are no longer in the transaction log for the table and are older than a retention threshold. The default threshold is 7 days, but you can specify an alternate retention interval. For example, to delete all stale files older than 8 days, you can execute the following SQL command:

VACUUM events RETAIN 192 HOURS

Warning

Databricks does not recommend that you set a retention interval shorter than 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If VACUUM cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when VACUUM deletes files that have not yet been committed.

Databricks Delta has a safety check to prevent you from running a dangerous VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false. You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.

Frequently asked questions (FAQ)

Why isn’t OPTIMIZE automatic?

The OPTIMIZE operation starts up many Spark jobs in order to optimize the file sizing via compaction (and optionally perform ZOrdering). Since much of what OPTIMIZE does is compact small files, you must first accumulate many small files before this operation has an effect. Therefore, the OPTIMIZE operation is not run automatically.

Moreover, running OPTIMIZE, especially with ZORDER, is an expensive operation in time and resources. If Databricks ran OPTIMIZE automatically or waited to write out data in batches, it would remove the ability to run low-latency Databricks Delta streams (where a Databricks Delta table is the source). Many customers have Databricks Delta tables that are never optimized because they only stream data from these tables, obviating the query benefits that OPTIMIZE would provide.

Lastly, Databricks Delta automatically collects statistics about the files that are written to the table (whether through an OPTIMIZE operation or not). This means that reads from Databricks Delta tables leverage this information whether or not the table or a partition has had the OPTIMIZE operation run on it.

How often should I run OPTIMIZE?

When you choose how often to run OPTIMIZE, there is a trade-off between performance and cost. You should run OPTIMIZE more often if you want better end-user query performance (necessarily at a higher cost because of resource usage). You should run it less often if you want to optimize cost.

We recommend you start by running OPTIMIZE on a daily basis (preferably at night when spot prices are low). Then modify your job from there.