Optimizing Performance and Cost¶
Delta can optimize the layout of data stored in DBFS to improve query speed. Delta currently supports two layout algorithms: bin-packing and zorder.
Compaction (or Bin-Packing)¶
Delta can optimize a table to improve the speed of read queries, for example by coalescing small files into larger ones. This operation can be triggered by running the following command:
If you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition predicate using WHERE:
OPTIMIZE events WHERE date >= '2017-01-01'
Readers of Delta tables use snapshot isolation, which means that they are not interrupted when OPTIMIZE deletes unnecessary small files from the transaction log.
Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second instance has no effect. Moreover, OPTIMIZE makes no data related changes to the table, so a read before and after an OPTIMIZE will have the same results.
Performing this operation on a table that is a streaming source will not affect any current or future streams that treat this table as a source.
ZOrdering (Multi-dimensional clustering)¶
ZOrdering is a technique that allows Delta to colocate related information in the same set of files. This co-locality is automatically used by Delta’s data-skipping algorithms to dramatically reduce the amount of data that needs to be read. In order to ZOrder data, you must specify the columns to perform the ordering on:
OPTIMIZE events WHERE date >= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)
You can specify multiple columns for
ZORDER BY as a comma-separated list. However, the effectiveness of the locality will drop with each additional column.
ZOrdering is not an idempotent operation, and thus will rearrange all of the data that matches the given filter. Therefore we suggest that you limit it to new data, using partition filters when possible.
How Often Should You Run
We are frequently asked how often you should run optimize on a Delta table. This is largely a business decision. If you want to make sure that end-user queries are always fast, it is well worth running optimize frequently.
However, optimize does start up a number of Spark jobs, making it resource-intensive. For this reason, we often recommend that you start by running optimize once a day or week to see if that meets your needs. One tip is to run optimize in the middle of the night and leverage spot pricing to make it very cheap to organize your data.
Data skipping information is collected automatically for you when you write data into a Delta table. Delta takes advantage of this information (currently min and max values) on your behalf at query time in order to attempt to provide faster queries. End users do not need to configure data skipping.
Garbage Collecting Old Files¶
Delta leaves deleted files on DBFS for a period of time to ensure that concurrent readers can continue reading a stale snapshot of the table. However, this functionality means that occasionally you must clear out these invalid files using the VACUUM command to save on storage costs. .. code:: sql
VACUUM events VACUUM ‘/path/to/data’
You can also specify DRY RUN to test the vacuum and return a list of files to be deleted:
VACUUM events HOURS DRY RUN
The VACUUM command removes any files that are no longer in the transaction log for the table and are older than a retention threshold. The default threshold is 7 days, but you can specify an alternate retention interval. For example, to delete all invalid files older than 1 day, you can execute the following SQL command:
Running VACUUM with a very small retention interval is unafe when there are concurrent readers or writers to the table. Concurrent read jobs can fail if you delete a stale snapshot that is currently being read. Files written by concurrent writers but not yet committed to the log can also be deleted if the retention period is shorter than the duraton of the job.