VACUUM (Databricks SQL)

Remove unused files from a table directory.

Note

This command works differently depending on whether you’re working on a Delta or Apache Spark table.

Vacuum a Delta table (Delta Lake on Databricks)

Recursively vacuum directories associated with the Delta table. VACUUM removes all files from the table directory that are not managed by Delta, as well as data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. VACUUM will skip all directories that begin with an _, which includes the _delta_log (partitioning your table on a column that begins with an _ is an exception to this rule; VACUUM scans all valid partitions included in the target Delta table). Delta table data files are deleted according to the time they have been logically removed from Delta’s transaction log + retention hours, not their modification timestamps on the storage system. The default threshold is 7 days.

On Delta tables, Databricks does not automatically trigger VACUUM operations. See Remove files no longer referenced by a Delta table.

If you run VACUUM on a Delta table, you lose the ability to time travel back to a version older than the specified data retention period.

Warning

It is recommended that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If VACUUM cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when VACUUM deletes files that have not yet been committed. You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.

VACUUM table_name [RETAIN num HOURS] [DRY RUN]

Parameters

  • table_name

    Identifies an existing Delta table. The name must not include a temporal specification.

  • RETAIN num HOURS

    The retention threshold.

  • DRY RUN

    Return a list of up to 1000 files to be deleted.

Vacuum a Spark table (Apache Spark)

Recursively vacuums directories associated with the Spark table and remove uncommitted files older than a retention threshold. The default threshold is 7 days.

On Spark tables, Databricks automatically triggers VACUUM operations as data is written. See Clean up uncommitted files.

Syntax

VACUUM table_name [RETAIN num HOURS]

Parameters

  • table_name

    Identifies an existing table by name or path.

  • RETAIN num HOURS

    The retention threshold.