Optimizing Performance and Cost¶
Databricks Delta can optimize the layout of data stored in DBFS to improve query speed. Databricks Delta supports two layout algorithms: bin-packing and ZOrdering.
Databricks Delta can optimize a table to improve the speed of read queries, for example by coalescing small files into larger ones. This operation can be triggered by running the
If you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition predicate using
OPTIMIZE events WHERE date >= '2017-01-01'
Readers of Databricks Delta tables use snapshot isolation, which means that they are not interrupted when
OPTIMIZE removes unnecessary files from the transaction log.
Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second instance has no effect. Moreover,
OPTIMIZE makes no data related changes to the table, so a read before and after an
OPTIMIZE have the same results.
OPTIMIZE on a table that is a streaming source does not affect any current or future streams that treat this table as a source.
ZOrdering (multi-dimensional clustering)¶
ZOrdering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Databricks Delta data-skipping algorithms to dramatically reduce the amount of data that needs to be read. To ZOrder data, you must specify the columns to order on:
OPTIMIZE events WHERE date >= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)
You can specify multiple columns for
ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional column.
ZOrdering is not an idempotent operation, and rearranges all of the data that matches the given filter. Therefore we suggest that you limit it to new data, using partition filters when possible.
How often should you run
We are frequently asked how often you should run
OPTIMIZE on a Databricks Delta table. This is largely a business decision. If you want to make sure that queries are always fast, it is well worth running
OPTIMIZE does start up a number of Spark jobs, making it resource-intensive. For this reason, we recommend that you start by running
OPTIMIZE once a day or week to see if that meets your needs. One tip is to run
OPTIMIZE in the middle of the night and leverage spot pricing to make it very cheap to organize your data.
Data skipping information is collected automatically for you when you write data into a Databricks Delta table. Databricks Delta takes advantage of this information (min and max values) on your behalf at query time in order to attempt to provide faster queries. You do not need to configure data skipping.
Databricks Delta leaves deleted files on DBFS for a period of time to ensure that concurrent readers can continue reading a stale snapshot of the table. However, this functionality means that occasionally you must clear out these invalid files using the
VACUUM command to save on storage costs.
Delta leaves deleted files on DBFS for a period of time to ensure that concurrent readers can continue reading a stale snapshot of the table. However, this functionality means that occasionally you must clear out these invalid files using the VACUUM command to save on storage costs.
VACUUM events VACUUM '/path/to/data'
You can also specify
DRY RUN to test the vacuum and return a list of files to be deleted:
VACUUM events DRY RUN
VACUUM command removes any files that are no longer in the transaction log for the table and are older than a retention threshold. The default threshold is 7 days, but you can specify an alternate retention interval. For example, to delete all stale files older than 8 days, you can execute the following SQL command:
VACUUM boxes RETAIN 192 HOURS
Databricks does not recommend that you set a retention interval shorter than 7 days because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If
VACUUM cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when
VACUUM deletes files that have not yet been committed. Databricks Delta has a safety check to prevent you from running a dangerous
VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the SQL conf
false. You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.