Auto Optimize
Auto Optimize is an optional set of features that automatically compact small files during individual writes to a Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively. Auto Optimize is particularly useful in the following scenarios:
- Streaming use cases where latency in the order of minutes is acceptable
MERGE INTO
is the preferred method of writing into Delta LakeCREATE TABLE AS SELECT
orINSERT INTO
are commonly used operations
How Auto Optimize works
Auto Optimize consists of two complementary features: Optimized Writes and Auto Compaction.
Optimized Writes
Databricks dynamically optimizes Apache Spark partition sizes based on the actual data, and attempts to write out 128 MB files for each table partition. This is an approximate size and can vary depending on dataset characteristics.
Auto Compaction
After an individual write, Databricks checks if files can further be compacted, and runs a quick OPTIMIZE
job (with 128 MB file sizes instead of 1GB) to further compact files for partitions that have the most number of small files.

Usage
Auto Optimize is designed to be configured for specific Delta tables. You enable Optimized Writes for a table by setting the table property delta.autoOptimize.optimizeWrite = true
. Similarly, you set delta.autoOptimize.autoCompact = true
to enable Auto Compaction.
For existing tables, run:
ALTER TABLE [table_name | delta.`<table-path>`] SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)
To ensure all new Delta tables have these features enabled, set the SQL configuration:
set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true; set spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true;
In addition, you can enable and disable both of these features for Spark sessions with the configurations:
spark.databricks.delta.optimizeWrite.enabled
spark.databricks.delta.autoCompact.enabled
The session configurations take precedence over the table properties allowing you to better control when to opt in or opt out of these features.
When to opt in and opt out
This section provides guidance on when to opt in and opt out of Auto Optimize features.
Optimized Writes
Optimized Writes aim to maximize the throughput of data being written to a storage service. This can be achieved by reducing the number of files being written, without sacrificing too much parallelism.
Optimized Writes require the shuffling of data according to the partitioning structure of the target table. This shuffle naturally incurs additional cost. However, the throughput gains during the write may pay off the cost of the shuffle. If not, the throughput gains when querying the data should still make this feature worthwhile.
The key part of Optimized Writes is that it is an adaptive shuffle. If you have a streaming ingest use case and input data rates change over time, the adaptive shuffle will adjust itself accordingly to the incoming data rates across micro-batches. If you have code snippets where you coalesce(n)
or repartition(n)
just before you write out your stream, you can remove those lines.
Auto Compaction
Auto Compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. This means that if you have code patterns where you make a write to Delta Lake, and then immediately call OPTIMIZE
, you can remove the OPTIMIZE
call if you enable Auto Compaction.
Auto Compaction uses different heuristics than OPTIMIZE
. Since it runs synchronously after a write, we have tuned Auto Compaction to run with the following properties:
- Databricks does not support Z-Ordering with Auto Compaction as Z-Ordering is significantly more expensive than just compaction.
- Auto Compaction generates smaller files (128 MB) than
OPTIMIZE
(1 GB). - Auto Compaction greedily chooses a limited set of partitions that would best leverage compaction. The number of partitions selected will vary depending on the size of cluster it is launched on. If your cluster has more CPUs, more partitions can be optimized.
Example workflow: Streaming ingest with concurrent deletes or updates
This workflow assumes that you have one cluster running a 24/7 streaming job ingesting data, and one cluster that runs on an hourly, daily, or ad-hoc basis to delete or update a batch of records. For this use case, Databricks recommends that you:
Enable Optimized Writes on the table level using
ALTER TABLE <table_name|delta.`table_path`> SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
This makes sure that the number of files written by the stream and the delete and update jobs are of optimal size.
Enable Auto Compaction on the session level using the following setting on the job that performs the delete or update.
spark.sql("set spark.databricks.delta.autoCompact.enabled = true")
This allows files to be compacted across your table. Since it happens after the delete or update, you mitigate the risks of a transaction conflict.
Frequently asked questions (FAQ)
Does Auto Optimize Z-Order files?
Auto Optimize performs compaction only on small files. It does not Z-Order files.
Does Auto Optimize corrupt Z-Ordered files?
Auto Optimize ignores files that are Z-Ordered. It only compacts new files.
If I have Auto Optimize enabled on a table that I’m streaming into, and a concurrent transaction conflicts with the optimize, will my job fail?
No. Transaction conflicts that cause Auto Optimize to fail are ignored, and the stream will continue to operate normally.
Do I need to schedule OPTIMIZE
jobs if Auto Optimize is enabled on my table?
For tables with size greater than 10 TB, we recommend that you keep OPTIMIZE
running on a schedule to further consolidate files, and reduce the metadata of your Delta table. Since Auto Optimize does not support Z-Ordering, you should still schedule OPTIMIZE ... ZORDER BY
jobs to run periodically.