Table size on Databricks
The table size reported for tables backed by Delta Lake on Databricks differs from the total size of corresponding file directories in cloud object storage. This article discusses why this difference exists and recommendations for controlling costs.
Why does my Delta table size not match the directory size?
Table sizes reported in Databricks through UIs and DESCRIBE
commands refer to the total size of data files on disk for those files referenced in the current version of the Delta table. Most operations that write to tables require rewriting underlying data files, but old data files are retained for a period of time to support time travel queries.
Note
If you regularly delete or update records in tables, deletion vectors can accelerate queries and reduce the total size of data files. See What are deletion vectors?.
Use predictive optimization to control data size
Databricks recommends using Unity Catalog managed tables with predictive optimization enabled. With managed tables and predictive optimization, Databricks automatically runs OPTIMIZE
and VACUUM
commands to prevent build up of unused data files. Expect there to always be a difference in size between the current version of a table and the total size of data files in cloud object storage. This is because data files not referenced in the current version are required to support time travel queries. See Predictive optimization for Unity Catalog managed tables.
What file metrics does VACUUM
report?
When you clean up unused data files with VACUUM
or use DRY RUN
to preview the files set for removal, metrics report the number of files and size of data removed. The size and number of files removed by VACUUM
varies drastically, but it is not uncommon for the size of removed files to exceed the total size of the current version of the table.
What file metrics does OPTIMIZE
report?
When OPTIMIZE
runs on a target table, new data files combine records from existing data files. Changes committed during OPTIMIZE
only impact data organization, and no changes to the underlying data contents occur. The total size of the data files associated with the table increases after OPTIMIZE
runs, because the new compacted files coexist in the containing directory with the no-longer-referenced data files.
The size of the table reported after OPTIMIZE
is generally smaller than the size before OPTIMIZE
runs, because the total size of data files referenced by the current table version decreases with data compaction. VACUUM
must run after the rentention threshold passes in order to remove the underlying data files.
Note
You might see similar metrics for operations such as REORG TABLE
or DROP FEATURE
. All operations that require rewriting data files increase the total size of data in the containing directory until VACUUM
removes data files no longer referenced in the current table version.