Low shuffle merge on Databricks

note

Low shuffle merge is generally available (GA) in Databricks Runtime 10.4 LTS and above and in Public Preview in Databricks Runtime 9.1 LTS. Databricks recommends that Preview customers migrate to Databricks Runtime 10.4 LTS or above.

The MERGE command is used to perform simultaneous updates, insertions, and deletions from a Delta Lake table. Databricks has an optimized implementation of MERGE that improves performance substantially for common workloads by reducing the number of shuffle operations.

Databricks low shuffle merge provides better performance by processing unmodified rows in a separate, more streamlined processing mode, instead of processing them together with the modified rows. As a result, the amount of shuffled data is reduced significantly, leading to improved performance. Low shuffle merge also reduces the need for users to re-run OPTIMIZE after performing a MERGE operation.

Optimized performance

Many MERGE workloads only update a relatively small number of rows in a table. However, Delta tables can only be updated on a per-file basis. When the MERGE command needs to update or delete a small number of rows that are stored in a particular file, then it must also process and rewrite all remaining rows that are stored in the same file, even though these rows are unmodified. Low shuffle merge optimizes the processing of unmodified rows. Previously, they were processed in the same way as modified rows, passing them through multiple shuffle stages and expensive calculations. In low shuffle merge, the unmodified rows are instead processed without any shuffles, expensive processing, or other added overhead.

Optimized data layout

Low shuffle merge is faster to run and benefits subsequent operations. The earlier MERGE implementation changed the data layout of unmodified data entirely, degrading performance on subsequent operations. Low shuffle merge preserves the existing data layout of the unmodified records, including liquid clustering layout, on a best-effort basis, performance degrades more slowly after running one or more MERGE commands.

note

Low shuffle merge tries to preserve the data layout on existing data that is not modified. The data layout of updated or newly inserted data may not be optimal, so it may still be necessary to run OPTIMIZE on tables with liquid clustering enabled.

Availability

Low shuffle merge is enabled by default in Databricks Runtime 10.4 and above. In earlier supported Databricks Runtime versions it can be enabled by setting the configuration spark.databricks.delta.merge.enableLowShuffle to true. This flag has no effect in Databricks Runtime 10.4 and above.

Legacy Z-ordering

For tables using Z-ordering, low shuffle merge also tries to preserve the existing Z-order layout on unmodified data on a best-effort basis. The data layout of updated or newly inserted data may not be optimal, so it may still be necessary to run OPTIMIZE ZORDER BY after a MERGE operation. Databricks recommends using liquid clustering for all new tables.

Optimized performance​

Optimized data layout​

Availability​

Legacy Z-ordering​

Optimized performance

Optimized data layout

Availability

Legacy Z-ordering