VACUUM

Clean up files associated with a table. There are different versions of this command for Delta and Apache Spark tables.

Vacuum a Delta table (Delta Lake on Databricks)

Recursively vacuum directories associated with the Delta table and remove data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. Files are deleted according to the time they have been logically removed from Delta’s transaction log + retention hours, not their modification timestamps on the storage system. The default threshold is 7 days. Databricks does not automatically trigger VACUUM operations on Delta tables. See Remove files no longer referenced by a Delta table.

If you run VACUUM on a Delta table, you lose the ability time travel back to a version older than the specified data retention period.

Warning

Databricks recommends that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If VACUUM cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when VACUUM deletes files that have not yet been committed. You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.

Delta Lake has a safety check to prevent you from running a dangerous VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false.

VACUUM table_identifier [RETAIN num HOURS] [DRY RUN]
  • table_identifier

    • [database_name.] table_name: A table name, optionally qualified with a database name.
    • delta.`<path-to-table>` : The location of an existing Delta table.
  • RETAIN num HOURS

    The retention threshold.

  • DRY RUN

    Return a list of files to be deleted.

Vacuum a Spark table

Recursively vacuums directories associated with the Spark table and remove uncommitted files older than a retention threshold. The default threshold is 7 days. Databricks automatically triggers VACUUM operations as data is written. See Clean up uncommitted files.

Syntax

VACUUM [ table_identifier | path] [RETAIN num HOURS]
  • table_identifier

    [database_name.] table_name: A table name, optionally qualified with a database name.

  • path

    Path to the table files.

  • RETAIN num HOURS

    The retention threshold.