Predictive I/O is a collection of Databricks optimizations that improve performance for data interactions. Predictive I/O capabilities are grouped into the following categories:
Accelerated reads reduce the time it takes to scan and read data.
Accelerated updates reduce the amount of data that needs to be rewritten during updates, deletes, and merges.
Predictive I/O is exclusive to the Photon engine on Databricks.
Predictive I/O is used to accelerate data scanning and filtering performance for all operations on support compute types.
Predictive I/O reads are supported by the serverless and pro types of SQL warehouses, and Photon-accelerated clusters running Databricks Runtime 11.2 and above.
Predictive I/O improves scanning performance by applying deep learning techniques to do the following:
Determine the most efficient access pattern to read the data and only scanning the data that is actually needed.
Eliminate the decoding of columns and rows that are not required to generate query results.
Calculate the probabilities of the search criteria in selective queries matching a row. As queries run, we use these probabilities to anticipate where the next matching row would occur and only read that data from cloud storage.
Support for predictive I/O for updates is in Public Preview for the serverless and pro types of SQL warehouses, as well as Photon-accelerated clusters running Databricks Runtime 12.1 and above.
When you use compute with Photon enabled, predictive I/O updates are used automatically for all tables that have deletion vectors enabled. See What are deletion vectors?.
You enable support for deletion vectors on a Delta Lake table by setting a Delta Lake table property. You enable deletion vectors during table creation or alter an existing table, as in the following examples:
CREATE TABLE <table-name> [options] TBLPROPERTIES ('delta.enableDeletionVectors' = true); ALTER TABLE <table-name> SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);
When you enable deletion vectors, the table protocol version is upgraded. Table protocol version upgrades are not reversible. After upgrading, the table will not be readable by Delta Lake clients that do not support deletion vectors. See How does Databricks manage Delta Lake feature compatibility?.
Predictive I/O leverages deletion vectors to accelerate updates by reducing the frequency of full file rewrites during data modification on Delta tables. Predictive I/O optimizes
Rather than rewriting all records in a data file when any record is updated or deleted, predictive I/O uses deletion vectors to indicate records have been removed from the target data files. Supplemental data files are used to indicate updates.
Subsequent reads on the table resolve current table state by applying the noted changes to the most recent table version.
Predictive I/O updates share all limitations with deletion vectors. In Databricks Runtime 12.1 and greater, the following limitations exist:
Delta Sharing is not supported on tables with deletion vectors enabled.
You cannot generate a manifest file for a table with deletion vectors present. Run
REORG TABLE ... APPLY (PURGE)and ensure no concurrent write operations are running in order to generate a manifest.
You cannot incrementally generate manifest files for a table with deletion vectors enabled.