Archival support in Databricks

Preview

This feature is in Public Preview for Databricks Runtime 13.3 LTS and above.

Archival support in Databricks introduces a collection of capabilities that enable you to use cloud-based lifecycle policies on cloud object storage containing Delta tables.

Without archival support, operations against Delta tables can break because data files or transaction log files have moved to archived locations and are not available when queried. Archival support introduces optimizations to avoid querying archived data when possible and adds new syntax to identify files that must be restored from archive to complete queries.

Important

Databricks only has archival support for S3 Glacier Deep Archive and Glacier Flexible Retrieval. See AWS docs on working with archived objects.

Queries optimized for archived data

Archival support in Databricks optimizes the following queries against Delta tables:

Query

New behavior

SELECT * FROM <table_name> LIMIT <limit> [WHERE <partition_predicate>]

Automatically ignore archived files and return results from data in a non-archived storage tier.

Delta Lake maintenance commands: OPTIMIZE, ZORDER, ANALYZE, PURGE

Automatically ignore archived files and run maintenance on rest of table.

DDL and DML statements that overwrite data or delete data, including the following: REPLACE TABLE, INSERT OVERWRITE, TRUNCATE TABLE, DROP TABLE

Mark transaction log entries for target archived data files as deleted.

FSCK REPAIR TABLE

Ignore archived files and only check for files that haven’t reached life cycle policy.

See Limitations.

Early failure and error messages

For queries that must scan archived files to generate correct results, configuring archival support for Delta Lake ensures the following:

  • Queries fail early if they attempt to access files in archive, reducing wasted compute and allowing users to quickly adapt and re-run queries.

  • Error messages inform users that a query has failed because the query attempted to access archived files.

Users can generate a report of files that need to be restored using the SHOW ARCHIVED FILES syntax. See Show archived files.

Enable archival support

You enable archival support in Databricks for Delta tables by manually specifying the archival interval configured in the underlying cloud lifecycle management policy, as in the following example syntax:

ALTER TABLE <table_name> SET TBLPROPERTIES(delta.timeUntilArchived = 'X days');

Delta Lake does not directly interact with the lifecyle management policies configured in your cloud account. If you update the policy in your cloud account, you must update the policy on your Delta table. See Change the lifecycle management transition rule.

Important

Archival support relies entirely on compatible Databricks compute environments and only works for Delta tables. Configuring archival support does not change behavior, compatibility, or support in OSS Delta Lake clients or Databricks Runtime 12.2 LTS and below.

Show archived files

To identify files that need to be restored to complete a given query, use SHOW ARCHIVED FILES, as in the following example:

SHOW ARCHIVED FILES FOR table_name [ WHERE predicate ];

This operation returns URIs for archived files as a Spark DataFrame.

Note

Delta Lake only has access to the data statistics contained within the transaction log during this operation (minimum value, maximum value, null counts, and total number of records for the first 32 columns). The files returned include all archived files that need to be read to determine whether or not records fulfilling a predicate exist in the file. Databricks recommends providing predicates that include fields on which data is partitioned, z-ordered, or clustered, if possible, to reduce the number of files that need to be restored.

Limitations

The following limitations exist:

  • No support exists for lifecycle management policies that are not based on file creation time. This includes access-time-based policies and tag-based policies.

  • You cannot use DROP COLUMN on a table with archived files.

  • REORG TABLE APPLY PURGE makes a best effort, but only works on deletion vector files and referenced data files that are not archived. PURGE cannot delete archived deletion vector files.

  • Extending the lifecycle management transition rule results in unexpected behavior. See Extend the lifecycle management transition rule.

Change the lifecycle management transition rule

If you change the time interval for your cloud lifecycle management transition rule, you must update the property delta.timeUntilArchived.

If the time interval before archival is shortened (less time since file creation), archival support for the Delta table continues functioning normally after the table property is updated.

Extend the lifecycle management transition rule

If the time interval before archival is extended (more time since file creation), updating the property delta.timeUntilArchived to the new value can lead to errors. Cloud providers do not restore files out of archived storage automatically when data retention policies are changed. This means that files that previously were eligible for archival but now are not considered eligible for archival are still archived.

Important

To avoid errors, never set the property delta.timeUntilArchived to a value greater than the actual age of the most recently archived data.

Consider a scenario in which the time interval for archival is changed from 60 days to 90 days:

  1. When the policy changes, all records between 60 and 90 days old are already archived.

  2. For 30 days, no new files are archived (the oldest non-archived files are 60 days old at the time the policy is extended).

  3. After 30 days have passed, the life cycle policy correctly describes all archived data.

The delta.timeUntilArchived setting tracks the set time interval against the file creation time recorded by the Delta transaction log. It does not have explicit knowledge of the underlying policy. During the lag period between the old archival threshold and the new archival threshold, you can take one of the following approaches to avoid querying archived files:

  1. You can leave the setting delta.timeUntilArchived with the old threshold until enough time has passed that all files are archived.

    • Following with the example above, each day for the first 30 days another day’s worth of data would be considered archived by Databricks but not yet archived by the cloud provider. This does not result in error, but ignores some data files that could be queried.

    • After 30 days, update the delta.timeUntilArchived to 90 days.

  2. You can update the setting delta.timeUntilArchived each day to reflect the current interval during the lag period.

    • While the cloud policy is set to 90 days, the actual age of archived data changes in real time. For example, after 7 days, setting delta.timeUntilArchived to 67 days accurately reflects the age of all data files in archive.

    • This approach is only necessary if you need access to all data in hot tiers.

Note

Updating the value for delta.timeUntilArchived does not actually change which data is archived. It only changes which data Databricks treats as if it were archived.