Automatic row deletion with auto time-to-live
Auto time-to-live (Auto-TTL) automatically deletes rows from Unity Catalog managed tables after a configurable time period based on the value in a timestamp column. You define an expiration period in days and specify a timestamp column for comparison. Databricks runs DELETE, PURGE, and VACUUM operations in the background to remove expired rows and clear them from storage.
The following are two examples for how to use auto time-to-live:
- You might want to remove data older than 1 year to keep storage costs low. Expire rows 1 year after creation by specifying a 365 day expiration period on a
created_attimestamp column. - You might want to remove data that has been marked for deletion by another business process. Expire rows 20 days after a delete request is processed by specifying a 20 day expiration period on a custom
del_request_approvedtimestamp column.
Exact deletion timing is not guaranteed and may vary based on system load. To verify deletion, query the predictive optimization system table or run DESCRIBE HISTORY on the table. See System tables.
The buffer time between row expiration and permanent deletion can be up to 6 days plus the value of the data retention table property, which defaults to 7 days. For information on how to configure auto time-to-live to delete within a specific timeframe, see Calculate configuration values for a target expiration period and Configure data retention for time travel queries.
Auto time-to-live is available for Unity Catalog managed Delta Lake tables, Apache Iceberg tables, and Streaming tables with Lakeflow Spark Declarative Pipelines.
Requirements
- You must turn on predictive optimization. See Predictive optimization for Unity Catalog managed tables.
- Turning off predictive optimization on a table with auto time-to-live enabled prevents auto time-to-live from running.
- You must have
MODIFYpermissions on a table to set or delete an auto time-to-live policy. See Basic table permissions. - Databricks Runtime 17.3 and above.
- Databricks Runtime 17.2 and below can read and write to tables with auto time-to-live.
Turn on auto time-to-live
Turn on auto time-to-live differently depending on the source table:
Delta Lake and Apache Iceberg managed tables
To set an auto time-to-live policy on a new table, specify a non-negative integer for <expiration_days> and a column with a type of DATE, TIMESTAMP, or TIMESTAMP_NTZ for <time_column_name>:
CREATE TABLE table_name DELETE ROWS <expiration_days> DAYS AFTER <time_column_name>;
To set an auto time-to-live policy on an existing table:
ALTER TABLE table_name DELETE ROWS <expiration_days> DAYS AFTER <time_column_name>;
For example, to delete rows 30 days after their created_at timestamp:
ALTER TABLE my_catalog.my_schema.my_table DELETE ROWS 30 DAYS AFTER created_at;
Streaming tables with Lakeflow Spark Declarative Pipelines
To set an auto time-to-live policy on a new streaming table in a Lakeflow Spark Declarative Pipelines pipeline, specify a non-negative integer for <expiration_days> and a column with a type of DATE, TIMESTAMP, or TIMESTAMP_NTZ for <time_column_name>:
- SQL
- Python
CREATE STREAMING TABLE table_name
DELETE ROWS <expiration_days> DAYS AFTER <time_column_name>
AS SELECT * FROM STREAM(source);
from pyspark import pipelines as dp
@dp.table(
auto_ttl={"timestamp_column": <time_column_name>, "expire_in_days": <expiration_days>}
)
def function_name():
return (query)
Altering a streaming table to use auto time-to-live using SQL is not supported. To modify auto time-to-live on an existing streaming table, update the pipeline code and republish.
Streaming reads from tables with auto time-to-live
If you use Structured Streaming, Lakeflow Spark Declarative Pipelines, or streaming tables to read from a table with auto time-to-live enabled, set skipChangeCommits on the streaming read. Auto time-to-live deletion operations appear as data changes. Without this setting, the streaming read fails when auto time-to-live deletes rows.
See the following examples:
- Structured Streaming
- Lakeflow Spark Declarative Pipelines
- Streaming tables
# Source table with auto time-to-live
spark.sql("ALTER TABLE source_table DELETE ROWS <expiration_days> DAYS AFTER <time_column_name>")
# Structured Streaming read
spark.readStream.format("delta").option("skipChangeCommits", "true").table("source_table")
from pyspark import pipelines as dp
# Source table with auto time-to-live
spark.sql("ALTER TABLE source_table DELETE ROWS <expiration_days> DAYS AFTER <time_column_name>")
# LDP streaming read
@dp.table
def my_table():
return spark.readStream.format("delta").option("skipChangeCommits", "true").table("source_table")
-- Source table with auto time-to-live
ALTER TABLE source_table DELETE ROWS <expiration_days> DAYS AFTER <time_column_name>;
-- LDP streaming read
CREATE OR REFRESH STREAMING TABLE my_table AS
SELECT * FROM STREAM(source_table) OPTIONS (skipChangeCommits);
Verify auto time-to-live is enabled
Use DESCRIBE TABLE EXTENDED to confirm that auto time-to-live is configured. If the autottl.expireInDays and autottl.timestampColumn properties are set, auto time-to-live is enabled.
The auto time-to-live settings appear in the Table Properties row:
DESCRIBE TABLE EXTENDED table_name;
Alternatively, use SHOW TBLPROPERTIES to see the auto time-to-live properties:
SHOW TBLPROPERTIES table_name;
Turn off auto time-to-live
To delete an auto time-to-live policy from a managed Delta Lake or Apache Iceberg table:
ALTER TABLE table_name DROP ROW DELETION;
To delete an auto time-to-live policy on a streaming table, set auto_ttl to None in the pipeline code and republish:
from pyspark import pipelines as dp
@dp.table(
auto_ttl=None
)
def function_name():
return (query)
Data lifecycle
Auto time-to-live can help automate data lifecycle management for tables with time-based retention requirements.
Auto time-to-live has a multi-stage data lifecycle. After a row has expired, predictive optimization asynchronously runs DELETE and VACUUM commands. If deletion vectors are enabled on the table, predictive optimization also runs PURGE before VACUUM to rewrite data files and remove deleted rows. See Purge metadata-only deletes to force data rewrite.
Exact deletion timing is not guaranteed and may vary based on system load. For information on how to verify that data has been deleted, see System tables.
To configure auto time-to-live correctly for your data retention requirements, review the stages below:
Stage | Duration | Description |
|---|---|---|
Expiration period | User defines when turning on auto time-to-live. | The number of days after the time column value when a row becomes eligible for deletion. Set this when you turn on auto time-to-live. |
Buffer time | Up to 3 days per command ( | The delay between when rows become eligible for deletion and when predictive optimization deletes them. Delays can occur between row expiration and each asynchronous command, |
Data retention duration | User defines with a table property. | The length of time that deleted rows remain in storage and accessible by time travel. For Delta Lake tables, configure with |
After permanent deletion via VACUUM, deleted rows are no longer accessible through time travel. See Remove unused data files with vacuum.
Here is a visual timeline of the data lifecycle, where a row written at time t moves through four phases before its files are physically removed by VACUUM:
Calculate configuration values for a target expiration period
Auto time-to-live deletes data asynchronously. See Data lifecycle.
To set up predictive optimization to remove rows from storage within a target number of days, subtract the maximum buffer time (6 days) and the deleted file retention duration from your target:
target_expiration_days = target_days - 6 - deletedFileRetentionDuration
For example, to remove rows within 30 days with the default 7-day retention period, set expiration_days to 17 DAYS:
target_expiration_days = 30 - 6 - 7 = 17 days
To remove rows within 90 days with a 30-day retention period, set expiration_days to 54 DAYS:
target_expiration_days = 90 - 6 - 30 = 54 days
Monitor auto time-to-live
With system tables you can verify auto time-to-live events, monitor costs, and set alerts for failures.
System tables
Verify auto time-to-live events with the predictive optimization system table. Predictive optimization runs DELETE to remove expired rows, VACUUM to delete them from storage, and optionally PURGE for tables with deletion vectors enabled to create new files without deleted rows.
Run the following query to review auto time-to-live operations across all tables in the past 7 days:
WITH tables_with_deletes AS (
SELECT DISTINCT catalog_name, schema_name, table_name
FROM system.storage.predictive_optimization_operations_history
WHERE
operation_type = 'DELETE'
AND timestampdiff(day, start_time, now()) < 7
)
SELECT hist.*
FROM system.storage.predictive_optimization_operations_history AS hist
INNER JOIN tables_with_deletes AS t
ON hist.catalog_name = t.catalog_name
AND hist.schema_name = t.schema_name
AND hist.table_name = t.table_name
WHERE
hist.operation_type IN ('DELETE', 'PURGE', 'VACUUM')
AND timestampdiff(day, hist.start_time, now()) < 7
ORDER BY hist.start_time DESC;
Set an alert for auto time-to-live failures
To receive notifications when auto time-to-live operations fail, create a Databricks SQL alert with a query that checks for failed operations in the predictive optimization system table. See Databricks SQL alert for instructions on how to create alerts, and system tables documentation for query examples.
Estimate auto time-to-live costs
Use the following query to see how many DBUs auto time-to-live operations consumed in the past 30 days:
WITH tables_with_deletes AS (
SELECT DISTINCT table_name
FROM system.storage.predictive_optimization_operations_history
WHERE
operation_type = 'DELETE'
AND timestampdiff(day, start_time, now()) < 30
)
SELECT SUM(usage_quantity) AS total_estimated_dbu
FROM system.storage.predictive_optimization_operations_history AS hist
INNER JOIN tables_with_deletes AS t
ON hist.table_name = t.table_name
WHERE
hist.operation_type IN ('DELETE', 'PURGE', 'VACUUM')
AND hist.usage_unit = 'ESTIMATED_DBU'
AND timestampdiff(day, hist.start_time, now()) < 30;
Review operations on a specific table
Use DESCRIBE HISTORY to see recent operations run on a specific table:
DESCRIBE HISTORY table_name;
Limitations
The following limitations apply to automatic time-to-live:
Exact deletion timing is not guaranteed and may vary based on system load. For information on how to verify that data has been deleted, see System tables.
- Auto time-to-live is not supported on materialized views.
ALTER TABLEandALTER STREAMING TABLEsyntax are not supported for modifying auto time-to-live on streaming tables. To add or change an auto time-to-live policy on an existing streaming table, update theauto_ttlparameter in the pipeline code and republish the pipeline.- Column renaming is not supported for time columns that are defined in an auto time-to-live policy. If column mapping is turned on, this limitation still applies. See Rename and drop columns with Delta Lake column mapping.
- In rare cases, auto time-to-live operations may cause transaction conflicts. To reduce the risk of transaction conflicts, use liquid clustering, which reduces conflicts with row-level concurrency. See Use liquid clustering for tables.
- If serverless compute can't access GCS because of private service connect, auto time-to-live operations might fail. See Enable Private Service Connect for your workspace.