This is an experimental integration developed by the Delta Lake open-source community. Use with caution and do not use in production.
A Delta table can be read by Redshift Spectrum using a manifest file, which is a text file containing the list of data files to read for querying a Delta table. This article describes how to set up a Redshift Spectrum to Delta Lake integration using manifest files and query Delta tables.
You set up a Redshift Spectrum to Delta Lake integration using the following steps.
Using a cluster running Databricks Runtime 5.5 or above, run the following command on a Delta table at location
GENERATE symlink_format_manifest FOR TABLE delta.`pathToDeltaTable`
generate command generates manifest files at
pathToDeltaTable/_symlink_format_manifest/. In other words, the files in this directory contain the names of the data files (that is, Parquet files) that should be read for reading a snapshot of the Delta table.
We recommend that you define the Delta table in a location that Redshift Spectrum can read directly.
That is, define the Delta table either with a S3 path or with a DBFS path (mounts allowed) whose underlying S3 path is known.
Run the following commands in your Redshift Spectrum environment.
You define a table that reads the file names specified in the manifest files as follows:
CREATE EXTERNAL TABLE spectrum.my_delta_manifest_table(filepath VARCHAR) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '<pathToDeltaTable>/_symlink_format_manifest/';
<pathToDeltaTable> with the full path to the Delta table. The location points to the manifest subdirectory
You define a table that reads all the Parquet files in the Delta table.
CREATE EXTERNAL TABLE spectrum.my_parquet_data_table( id bigint, part bigint, ...) STORED AS PARQUET LOCATION '<pathToDeltaTable>'
Querying the Delta table as this Parquet table will produce incorrect results because the query will read all the Parquet files in this table rather than only those that define a consistent snapshot of the table. You can use the manifest table to get a consistent snapshot data.
To read only the rows belonging to the consistent snapshot defined in the generated manifests, you can apply a filter to keep only the rows in the Parquet table that came from the files defined in the manifest table.
SELECT id, part, "$path" as parquet_filename FROM spectrum.my_parquet_data_table WHERE parquet_filename IN (SELECT filepath FROM spectrum.my_delta_manifest_table);
The query provides you with a consistent view of the Delta table.
When data in a Delta table is updated, you must regenerate the manifests using either of the following approaches:
Update explicitly: Run the
Update automatically: You can configure a Delta table so that all write operations on the table automatically update the manifests. To enable this automatic mode, set the corresponding table property using the following SQL command.
ALTER TABLE delta.`pathToDeltaTable` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
The Redshift Spectrum integration has known limitations in its behavior.
Whenever Delta Lake generates updated manifests, it atomically overwrites existing manifest files. Therefore, Redshift Spectrum will always see a consistent view of the data files; it will see all of the old version files or all of the new version files. However, the granularity of the consistency guarantees depends on whether the table is partitioned or not.
- Unpartitioned tables: All the files names are written in one manifest file which is updated atomically. In this case Redshift Spectrum will see full table snapshot consistency.
- Partitioned tables: A manifest file is partitioned in the same Hive-partitioning-style directory structure as the original Delta table. This means that each partition is updated atomically, and Redshift Spectrum will see a consistent view of each partition but not a consistent view across partitions. Furthermore, since all manifests of all partitions cannot be updated together, concurrent attempts to generate manifests can lead to different partitions having manifests of different versions.
This is an experimental integration and its performance and scalability characteristics have not yet been tested.
Delta Lake supports schema evolution and queries on a Delta table automatically use the latest schema regardless of the schema defined in the table in the Hive metastore. However, Redshift Spectrum uses the schema defined in its table definition, and will not query with the updated schema until the table definition is updated to the new schema.