VACUUM
Applies to: Databricks SQL Databricks Runtime
Remove unused files from a table directory.
Note
This command works differently depending on whether you’re working on a Delta or Apache Spark table.
Vacuum a Delta table
Recursively vacuum directories associated with the Delta table. For full details and limitations, see Remove unused data files with vacuum.
VACUUM
removes all files from the table directory that are not managed by Delta, as well as data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. VACUUM
will skip all directories that begin with an underscore (_
), which includes the _delta_log
. Partitioning your table on a column that begins with an underscore is an exception to this rule; VACUUM
scans all valid partitions included in the target Delta table. Delta table data files are deleted according to the time they have been logically removed from Delta’s transaction log plus retention hours, not their modification timestamps on the storage system. The default threshold is 7 days.
On Delta tables, Databricks does not automatically trigger VACUUM
operations.
If you run VACUUM
on a Delta table, you lose the ability to time travel back to a version older than the specified data retention period.
Warning
It is recommended that you set a retention interval to be at least 7 days,
because old snapshots and uncommitted files can still be in use by concurrent
readers or writers to the table. If VACUUM
cleans up active files,
concurrent readers can fail or, worse, tables can be corrupted when VACUUM
deletes files that have not yet been committed. You must choose an interval
that is longer than the longest running concurrent transaction and the longest
period that any stream can lag behind the most recent update to the table.
Delta Lake has a safety check to prevent you from running a dangerous VACUUM
command. In Databricks Runtime, you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled
to false
.
VACUUM table_name { { FULL | LITE } | RETAIN num HOURS | DRY RUN } [...]
In Databricks Runtime 16.0 and earlier RETAIN num HOURS
must precede DRY RUN
in the syntax.
Important
The retention window for the VACUUM
command is determined by the delta.deletedFileRetentionDuration
table property, which defaults to 7 days.
This means VACUUM
removes data files that are no longer referenced by a Delta table version in the last 7 days.
To retain data for a longer period, such as to support time travel for longer durations, set this table property to a higher value.
The following example shows setting this threshold to 30 days:
ALTER TABLE table_name SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '30 days');
Parameters
-
Identifies an existing Delta table. The name must not include a temporal specification or options specification.
RETAIN num HOURS
The retention threshold.
DRY RUN
Return a list of up to 1000 files to be deleted.
FULL
orLITE
Applies to: Databricks Runtime 16.1 and later
FULL
runs the vacuum operation in Full mode. This deletes data files outside of the retention duration and all files in the table directory not referenced by the table.LITE
runs the vacuum operation in Lite mode. Instead of finding all files in the table directory,VACUUM
uses the Delta transaction log to identify and remove files no longer referenced by any table versions within the retention duration. IfVACUUM LITE
cannot be completed because the Delta log has been pruned aDELTA_CANNOT_VACUUM_LITE
exception is raised.The default is
FULL
.
Vacuum a non-Delta table
Recursively vacuums directories associated with the non-Delta table and remove uncommitted files older than a retention threshold. The default threshold is 7 days.
On non-Delta tables, Databricks automatically triggers VACUUM
operations as data is written.