What are deletion vectors?
Support for reading and writing Delta tables with deletion vectors is in Public Preview in Databricks Runtime 12.1 and above.
Deletion vectors are a storage optimization feature that can be enabled on Delta Lake tables. By default, when a single row in a data file is deleted, the entire Parquet file containing the record must be rewritten. With deletion vectors enabled for the table,
DELETE operations use deletion vectors to mark existing rows as removed without rewriting the Parquet file. Subsequent reads on the table resolve current table state by applying the deletions noted by deletion vectors to the most recent table version.
Photon leverages deletion vectors for predictive I/O updates, accelerating
UPDATE operations. All clients that support reading deletion vectors can read updates that produced deletion vectors, regardless of whether these updates were produced by predictive I/O. See Use predictive I/O to accelerate updates.
Enable deletion vectors
You enable support for deletion vectors on a Delta Lake table by setting a Delta Lake table property:
ALTER TABLE <table_name> SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);
When you enable deletion vectors, the table protocol version is upgraded. Table protocol version upgrades are not reversible. After upgrading, the table will not be readable by Delta Lake clients that do not support deletion vectors. See How does Databricks manage Delta Lake feature compatibility?.
Apply changes to Parquet data files
Deletion vectors indicate changes to rows as soft-deletes that logically modify existing Parquet data files in the Delta Lake table. These changes are applied physically when data files are rewritten, as triggered by one of the following events:
OPTIMIZEcommand is run on the table.
Auto-compaction triggers a rewrite of a data file with a deletion vector.
REORG TABLE ... APPLY (PURGE)is run against the table.
Events related to file compaction do not have strict guarantees for resolving changes recorded in deletion vectors, and some changes recorded in deletion vectors might not be applied if target data files would not otherwise be candidates for file compaction.
REORG TABLE ... APPLY (PURGE) rewrites all data files containing records with modifications recorded using deletion vectors. See REORG TABLE.
Modified data might still exist in the old files. You can run VACUUM to physically delete the old files.
REORG TABLE ... APPLY (PURGE) creates a new version of the table at the time it completes, which is the timestamp you must consider for the retention threshold for your
VACUUM operation to fully remove deleted files. See Remove unused data files with vacuum.
In Databricks Runtime 12.1 and greater, the following limitations exist:
Delta Sharing is not supported on tables with deletion vectors enabled.
You cannot generate a manifest file for a table with deletion vectors present. Run
REORG TABLE ... APPLY (PURGE)and ensure no concurrent write operations are running in order to generate a manifest.
You cannot incrementally generate manifest files for a table with deletion vectors enabled.