Table size on Databricks

The table size reported for tables backed by Delta Lake on Databricks differs from the total size of corresponding file directories in cloud object storage. This article discusses why this difference exists and recommendations for controlling costs.

Why does my Delta table size not match the directory size?

Table sizes reported in Databricks through UIs and DESCRIBE commands refer to the total size of data files on disk for those files referenced in the current version of the Delta table. Most operations that write to tables require rewriting underlying data files, but old data files are retained for a period of time to support time travel queries.

Note

If you regularly delete or update records in tables, deletion vectors can accelerate queries and reduce the total size of data files. See What are deletion vectors?.

Use predictive optimization to control data size

Databricks recommends using Unity Catalog managed tables with predictive optimization enabled. With managed tables and predictive optimization, Databricks automatically runs OPTIMIZE and VACUUM commands to prevent build up of unused data files. Expect there to always be a difference in size between the current version of a table and the total size of data files in cloud object storage. This is because data files not referenced in the current version are required to support time travel queries. See Predictive optimization for Delta Lake.

What file metrics does VACUUM report?

When you clean up unused data files with VACUUM or use DRY RUN to preview the files set for removal, metrics report the number of files and size of data removed. The size and number of files removed by VACUUM varies drastically, but it is not uncommon for the size of removed files to exceed the total size of the current version of the table.

What file metrics does OPTIMIZE report?

When OPTIMIZE runs on a target table, new data files combine records from existing data files. Changes committed during OPTIMIZE only impact data organization, and no changes to the underlying data contents occur. The total size of the data files associated with the table increases after OPTIMIZE runs, because the new compacted files coexist in the containing directory with the no-longer-referenced data files.

The size of the table reported after OPTIMIZE is generally smaller than the size before OPTIMIZE runs, because the total size of data files referenced by the current table version decreases with data compaction. VACUUM must run after the rentention threshold passes in order to remove the underlying data files.

Note

You might see similar metrics for operations such as REORG TABLE or DROP FEATURE. All operations that require rewriting data files increase the total size of data in the containing directory until VACUUM removes data files no longer referenced in the current table version.