Skip to main content

Automatic row deletion with auto time-to-live

Auto time-to-live (Auto-TTL) automatically deletes rows from Unity Catalog managed tables after a configurable time period based on the value in a timestamp column. You define an expiration period in days and specify a timestamp column for comparison. Databricks runs DELETE, PURGE, and VACUUM operations in the background to remove expired rows and clear them from storage.

The following are two examples for how to use auto time-to-live:

  • You might want to remove data older than 1 year to keep storage costs low. Expire rows 1 year after creation by specifying a 365 day expiration period on a created_at timestamp column.
  • You might want to remove data that has been marked for deletion by another business process. Expire rows 20 days after a delete request is processed by specifying a 20 day expiration period on a custom del_request_approved timestamp column.
important

Exact deletion timing is not guaranteed and may vary based on system load. To verify deletion, query the predictive optimization system table or run DESCRIBE HISTORY on the table. See System tables.

The buffer time between row expiration and permanent deletion can be up to 6 days plus the value of the data retention table property, which defaults to 7 days. For information on how to configure auto time-to-live to delete within a specific timeframe, see Calculate configuration values for a target expiration period and Configure data retention for time travel queries.

Auto time-to-live is available for Unity Catalog managed Delta Lake tables, Apache Iceberg tables, and Streaming tables with Lakeflow Spark Declarative Pipelines.

Requirements

  • You must turn on predictive optimization. See Predictive optimization for Unity Catalog managed tables.
    • Turning off predictive optimization on a table with auto time-to-live enabled prevents auto time-to-live from running.
  • You must have MODIFY permissions on a table to set or delete an auto time-to-live policy. See Basic table permissions.
  • Databricks Runtime 17.3 and above.
    • Databricks Runtime 17.2 and below can read and write to tables with auto time-to-live.

Turn on auto time-to-live

Turn on auto time-to-live differently depending on the source table:

Delta Lake and Apache Iceberg managed tables

To set an auto time-to-live policy on a new table, specify a non-negative integer for <expiration_days> and a column with a type of DATE, TIMESTAMP, or TIMESTAMP_NTZ for <time_column_name>:

SQL
CREATE TABLE table_name DELETE ROWS <expiration_days> DAYS AFTER <time_column_name>;

To set an auto time-to-live policy on an existing table:

SQL
ALTER TABLE table_name DELETE ROWS <expiration_days> DAYS AFTER <time_column_name>;

For example, to delete rows 30 days after their created_at timestamp:

SQL
ALTER TABLE my_catalog.my_schema.my_table DELETE ROWS 30 DAYS AFTER created_at;

Streaming tables with Lakeflow Spark Declarative Pipelines

To set an auto time-to-live policy on a new streaming table in a Lakeflow Spark Declarative Pipelines pipeline, specify a non-negative integer for <expiration_days> and a column with a type of DATE, TIMESTAMP, or TIMESTAMP_NTZ for <time_column_name>:

SQL
CREATE STREAMING TABLE table_name
DELETE ROWS <expiration_days> DAYS AFTER <time_column_name>
AS SELECT * FROM STREAM(source);

Altering a streaming table to use auto time-to-live using SQL is not supported. To modify auto time-to-live on an existing streaming table, update the pipeline code and republish.

Streaming reads from tables with auto time-to-live

If you use Structured Streaming, Lakeflow Spark Declarative Pipelines, or streaming tables to read from a table with auto time-to-live enabled, set skipChangeCommits on the streaming read. Auto time-to-live deletion operations appear as data changes. Without this setting, the streaming read fails when auto time-to-live deletes rows.

See the following examples:

Python
# Source table with auto time-to-live
spark.sql("ALTER TABLE source_table DELETE ROWS <expiration_days> DAYS AFTER <time_column_name>")

# Structured Streaming read
spark.readStream.format("delta").option("skipChangeCommits", "true").table("source_table")

Verify auto time-to-live is enabled

Use DESCRIBE TABLE EXTENDED to confirm that auto time-to-live is configured. If the autottl.expireInDays and autottl.timestampColumn properties are set, auto time-to-live is enabled.

The auto time-to-live settings appear in the Table Properties row:

SQL
DESCRIBE TABLE EXTENDED table_name;

Alternatively, use SHOW TBLPROPERTIES to see the auto time-to-live properties:

SQL
SHOW TBLPROPERTIES table_name;

Turn off auto time-to-live

To delete an auto time-to-live policy from a managed Delta Lake or Apache Iceberg table:

SQL
ALTER TABLE table_name DROP ROW DELETION;

To delete an auto time-to-live policy on a streaming table, set auto_ttl to None in the pipeline code and republish:

Python
from pyspark import pipelines as dp

@dp.table(
auto_ttl=None
)
def function_name():
return (query)

Data lifecycle

Auto time-to-live can help automate data lifecycle management for tables with time-based retention requirements.

Auto time-to-live has a multi-stage data lifecycle. After a row has expired, predictive optimization asynchronously runs DELETE and VACUUM commands. If deletion vectors are enabled on the table, predictive optimization also runs PURGE before VACUUM to rewrite data files and remove deleted rows. See Purge metadata-only deletes to force data rewrite.

Exact deletion timing is not guaranteed and may vary based on system load. For information on how to verify that data has been deleted, see System tables.

To configure auto time-to-live correctly for your data retention requirements, review the stages below:

Stage

Duration

Description

Expiration period

User defines when turning on auto time-to-live.

The number of days after the time column value when a row becomes eligible for deletion. Set this when you turn on auto time-to-live.

Buffer time

Up to 3 days per command (DELETE, VACUUM)

The delay between when rows become eligible for deletion and when predictive optimization deletes them. Delays can occur between row expiration and each asynchronous command, DELETE and VACUUM. Each delay is typically fewer than 3 days, up to a total of 6 days.

Data retention duration

User defines with a table property.

The length of time that deleted rows remain in storage and accessible by time travel. For Delta Lake tables, configure with delta.deletedFileRetentionDuration. For Apache Iceberg tables, configure with iceberg.deletedFileRetentionDuration. If the property is not set, the default is 7 days. See Configure data retention for time travel queries.

After permanent deletion via VACUUM, deleted rows are no longer accessible through time travel. See Remove unused data files with vacuum.

Here is a visual timeline of the data lifecycle, where a row written at time t moves through four phases before its files are physically removed by VACUUM:

Diagram of the auto time-to-live data lifecycle, showing expiration period, buffer time, data retention period, and permanent deletion phases along a day timeline.

Calculate configuration values for a target expiration period

important

Auto time-to-live deletes data asynchronously. See Data lifecycle.

To set up predictive optimization to remove rows from storage within a target number of days, subtract the maximum buffer time (6 days) and the deleted file retention duration from your target:

Text
target_expiration_days = target_days - 6 - deletedFileRetentionDuration

For example, to remove rows within 30 days with the default 7-day retention period, set expiration_days to 17 DAYS:

Text
target_expiration_days = 30 - 6 - 7 = 17 days

To remove rows within 90 days with a 30-day retention period, set expiration_days to 54 DAYS:

Text
target_expiration_days = 90 - 6 - 30 = 54 days

Monitor auto time-to-live

With system tables you can verify auto time-to-live events, monitor costs, and set alerts for failures.

System tables

Verify auto time-to-live events with the predictive optimization system table. Predictive optimization runs DELETE to remove expired rows, VACUUM to delete them from storage, and optionally PURGE for tables with deletion vectors enabled to create new files without deleted rows.

Run the following query to review auto time-to-live operations across all tables in the past 7 days:

SQL
WITH tables_with_deletes AS (
SELECT DISTINCT catalog_name, schema_name, table_name
FROM system.storage.predictive_optimization_operations_history
WHERE
operation_type = 'DELETE'
AND timestampdiff(day, start_time, now()) < 7
)
SELECT hist.*
FROM system.storage.predictive_optimization_operations_history AS hist
INNER JOIN tables_with_deletes AS t
ON hist.catalog_name = t.catalog_name
AND hist.schema_name = t.schema_name
AND hist.table_name = t.table_name
WHERE
hist.operation_type IN ('DELETE', 'PURGE', 'VACUUM')
AND timestampdiff(day, hist.start_time, now()) < 7
ORDER BY hist.start_time DESC;

Set an alert for auto time-to-live failures

To receive notifications when auto time-to-live operations fail, create a Databricks SQL alert with a query that checks for failed operations in the predictive optimization system table. See Databricks SQL alert for instructions on how to create alerts, and system tables documentation for query examples.

Estimate auto time-to-live costs

Use the following query to see how many DBUs auto time-to-live operations consumed in the past 30 days:

SQL
WITH tables_with_deletes AS (
SELECT DISTINCT table_name
FROM system.storage.predictive_optimization_operations_history
WHERE
operation_type = 'DELETE'
AND timestampdiff(day, start_time, now()) < 30
)
SELECT SUM(usage_quantity) AS total_estimated_dbu
FROM system.storage.predictive_optimization_operations_history AS hist
INNER JOIN tables_with_deletes AS t
ON hist.table_name = t.table_name
WHERE
hist.operation_type IN ('DELETE', 'PURGE', 'VACUUM')
AND hist.usage_unit = 'ESTIMATED_DBU'
AND timestampdiff(day, hist.start_time, now()) < 30;

Review operations on a specific table

Use DESCRIBE HISTORY to see recent operations run on a specific table:

SQL
DESCRIBE HISTORY table_name;

Limitations

The following limitations apply to automatic time-to-live:

important

Exact deletion timing is not guaranteed and may vary based on system load. For information on how to verify that data has been deleted, see System tables.

  • Auto time-to-live is not supported on materialized views.
  • ALTER TABLE and ALTER STREAMING TABLE syntax are not supported for modifying auto time-to-live on streaming tables. To add or change an auto time-to-live policy on an existing streaming table, update the auto_ttl parameter in the pipeline code and republish the pipeline.
  • Column renaming is not supported for time columns that are defined in an auto time-to-live policy. If column mapping is turned on, this limitation still applies. See Rename and drop columns with Delta Lake column mapping.
  • In rare cases, auto time-to-live operations may cause transaction conflicts. To reduce the risk of transaction conflicts, use liquid clustering, which reduces conflicts with row-level concurrency. See Use liquid clustering for tables.