Parquet v2
Available in Databricks Runtime 18.1 and above, Parquet v2 improves query performance and reduces storage for Delta Lake and Apache Iceberg tables by using advanced encodings, v2 data page headers, and INT64 timestamps.
How Parquet v2 works
Parquet v2 introduces improvements to the data file format that reduce storage and improve read performance:
- Advanced encodings: Integer and string columns use new encodings with more efficient compression and decoding compared to the encodings used in Parquet v1.
- V2 data page headers: Page-level statistics and indexes improve predicate pushdown and data skipping, reducing the amount of data scanned at query time.
- INT64 timestamps:
INT64timestamps replace the legacyINT96timestamp, improving column statistics, and timestamp encoding and compression.
Enable Parquet v2
Databricks automatically upgrades compatible Unity Catalog managed tables to Parquet v2. See Automatic upgrades.
To enable Parquet v2 manually, set the parquet.format.version table property to 2.12.0 using the appropriate prefix for your table type. Before manually enabling, review the limitations.
To enable Parquet v2 on an existing table:
- Delta Lake
- Iceberg
ALTER TABLE <table_name> SET TBLPROPERTIES ('delta.parquet.format.version' = '2.12.0');
ALTER TABLE <table_name> SET TBLPROPERTIES ('iceberg.parquet.format.version' = '2.12.0');
To enable Parquet v2 on a new table:
- Delta Lake
- Iceberg
CREATE TABLE <table_name> (...)
TBLPROPERTIES ('delta.parquet.format.version' = '2.12.0');
CREATE TABLE <table_name> (...)
USING iceberg
TBLPROPERTIES ('iceberg.parquet.format.version' = '2.12.0');
After you set the property, all subsequent writes use v2 encodings. Parquet v1 and v2 files can coexist in the same table. Existing data files aren't rewritten automatically. For Delta Lake tables, to rewrite existing files to v2, use REORG TABLE in Databricks Runtime 18.2 and above:
REORG TABLE <table_name> APPLY (SET PARQUET (FORMAT_VERSION = '2.12.0'));
See Table properties reference for the full table property reference.
Roll back to Parquet v1
In Databricks Runtime 18.2 and above, to roll back an individual table to v1 encoding, run REORG TABLE with the SET PARQUET option:
REORG TABLE <table_name> APPLY (SET PARQUET (FORMAT_VERSION = '1.0.0'));
This command rewrites all data files using v1 encodings and resets the delta.parquet.format.version table property to 1.0.0.
See REORG TABLE for full REORG TABLE syntax.
Limitations
Parquet v2 has the following limitations:
- For external engines, some Apache Iceberg readers might not support Parquet v2 for Iceberg tables. You must verify that Iceberg readers are compatible before enabling Parquet v2 on Iceberg tables.
- For OpenSharing, before enabling Parquet v2, verify that the reader clients for OpenSharing recipients support Parquet v2 encodings.
- Materialized views and streaming tables aren't automatically upgraded to Parquet v2. You can manually enable Parquet v2 on these table types on Databricks Runtime 18.1 and above.