This feature is in Public Preview for Databricks Runtime 13.2 and above.
Archival support in Databricks introduces a collection of capabilities that enable you to use cloud-based lifecycle policies on cloud object storage containing Delta tables.
Without archival support, operations against Delta tables can break because data files or transaction log files have moved to archived locations and are not available when queried. Archival support introduces optimizations to avoid querying archived data when possible and adds new syntax to identify files that must be restored from archive to complete queries.
Databricks only has archival support for S3 Glacier Deep Archive and Glacier Flexible Retrieval. See AWS docs on working with archived objects.
Archival support in Databricks optimizes the following queries against Delta tables:
Automatically ignore archived files and return results from data in a non-archived storage tier.
Delta Lake maintenance commands:
Automatically ignore archived files and run maintenance on rest of table.
DDL and DML statements that overwrite data or delete data, including the following:
Mark transaction log entries for target archived data files as deleted.
Ignore archived files and only check for files that haven’t reached life cycle policy.
For queries that must scan archived files to generate correct results, configuring archival support for Delta Lake ensures the following:
Queries fail early if they attempt to access files in archive, reducing wasted compute and allowing users to quickly adapt and re-run queries.
Error messages inform users that a query has failed because the query attempted to access archived files.
Users can generate a report of files that need to be restored using the
SHOW ARCHIVED FILES syntax. See Show archived files.
You enable archival support in Databricks for Delta tables by manually specifying the archival interval configured in the underlying cloud lifecycle management policy, as in the following example syntax:
ALTER TABLE <table_name> SET TBLPROPERTIES(delta.timeUntilArchived = 'X days');
Delta Lake does not directly interact with the lifecyle management policies configured in your cloud account. If you update the policy in your cloud account, you must update the policy on your Delta table. See Change the lifecycle management transition rule.
Archival support relies entirely on compatible Databricks compute environments and only works for Delta tables. Configuring archival support does not change behavior, compatibility, or support in OSS Delta Lake clients or Databricks Runtime 13.1 and below.
To identify files that need to be restored to complete a given query, use
SHOW ARCHIVED FILES, as in the following example:
SHOW ARCHIVED FILES FOR table_name [ WHERE predicate ];
This operation returns URIs for archived files as a Spark DataFrame.
Delta Lake only has access to the data statistics contained within the transaction log during this operation (minimum value, maximum value, null counts, and total number of records for the first 32 columns). The files returned include all archived files that need to be read to determine whether or not records fulfilling a predicate exist in the file. Databricks recommends providing predicates that include fields on which data is partitioned, z-ordered, or clustered, if possible, to reduce the number of files that need to be restored.
The following limitations exist:
No support exists for lifecycle management policies that are not based on file creation time. This includes access-time-based policies and tag-based policies.
You cannot use
DROP COLUMNon a table with archived files.
REORG TABLE APPLY PURGEmakes a best effort, but only works on deletion vector files and referenced data files that are not archived.
PURGEcannot delete archived deletion vector files.
Extending the lifecycle management transition rule results in unexpected behavior. See Extend the lifecycle management transition rule.
If you change the time interval for your cloud lifecycle management transition rule, you must update the property
If the time interval before archival is shortened (less time since file creation), archival support for the Delta table continues functioning normally after the table property is updated.
If the time interval before archival is extended (more time since file creation), updating the property
delta.timeUntilArchived to the new value can lead to errors. Cloud providers do not restore files out of archived storage automatically when data retention policies are changed. This means that files that previously were eligible for archival but now are not considered eligible for archival are still archived.
To avoid errors, never set the property
delta.timeUntilArchived to a value greater than the actual age of the most recently archived data.
Consider a scenario in which the time interval for archival is changed from 60 days to 90 days:
When the policy changes, all records between 60 and 90 days old are already archived.
For 30 days, no new files are archived (the oldest non-archived files are 60 days old at the time the policy is extended).
After 30 days have passed, the life cycle policy correctly describes all archived data.
delta.timeUntilArchived setting tracks the set time interval against the file creation time recorded by the Delta transaction log. It does not have explicit knowledge of the underlying policy. During the lag period between the old archival threshold and the new archival threshold, you can take one of the following approaches to avoid querying archived files:
You can leave the setting
delta.timeUntilArchivedwith the old threshold until enough time has passed that all files are archived.
Following with the example above, each day for the first 30 days another day’s worth of data would be considered archived by Databricks but not yet archived by the cloud provider. This does not result in error, but ignores some data files that could be queried.
After 30 days, update the
You can update the setting
delta.timeUntilArchivedeach day to reflect the current interval during the lag period.
While the cloud policy is set to 90 days, the actual age of archived data changes in real time. For example, after 7 days, setting
67 daysaccurately reflects the age of all data files in archive.
This approach is only necessary if you need access to all data in hot tiers.
Updating the value for
delta.timeUntilArchived does not actually change which data is archived. It only changes which data Databricks treats as if it were archived.