Low shuffle merge on Databricks
Low shuffle merge is supported in Databricks Runtime 9.0 and above. It is generally available (GA) in Databricks Runtime 10.3 and above and in Public Preview in Databricks Runtime 9.0, 9.1, 10.0, 10.1, and 10.2. We recommend that Preview customers migrate to Databricks Runtime 10.3 or above.
The MERGE command is used to perform simultaneous updates, insertions, and deletions from a Delta Lake table. Databricks has an optimized implementation of
MERGE that improves performance substantially for common workloads by reducing the number of shuffle operations.
Databricks low shuffle merge provides better performance by processing unmodified rows in a separate, more streamlined processing mode, instead of processing them together with the modified rows. As a result, the amount of shuffled data is reduced significantly, leading to improved performance. Low shuffle merge also reduces the need for users to re-run the OPTIMIZE ZORDER BY command after performing a
MERGE workloads only update a relatively small number of rows in a table. However, Delta tables can only be updated on a per-file basis. When the
MERGE command needs to update or delete a small number of rows that are stored in a particular file, then it must also process and rewrite all remaining rows that are stored in the same file, even though these rows are unmodified. Low shuffle merge optimizes the processing of unmodified rows. Previously, they were processed in the same way as modified rows, passing them through multiple shuffle stages and expensive calculations. In low shuffle merge, the unmodified rows are instead processed without any shuffles, expensive processing, or other added overhead.
Optimized data layout
In addition to being faster to run, low shuffle merge benefits subsequent operations as well. The earlier
MERGE implementation caused the data layout of unmodified data to be changed entirely, resulting in lower performance on subsequent operations. Low shuffle merge tries to preserve the existing data layout of the unmodified records, including Z-order optimization on a best-effort basis. Hence, with low shuffle merge, the performance of operations on a Delta table will degrade more slowly after running one or more
Low shuffle merge tries to preserve the data layout on existing data that is not modified. The data layout of updated or newly inserted data may not be optimal, so it may still be necessary to run the
OPTIMIZE or OPTIMIZE ZORDER BY commands.
Low shuffle merge is enabled by default in Databricks Runtime 10.4 and above. In earlier supported Databricks Runtime versions it can be enabled by setting the configuration
true. This flag has no effect in Databricks Runtime 10.4 and above.