Partition discovery for external tables
This article describes the default partition discovery strategy for Unity Catalog external tables and an optional setting to enable a partition metadata log that makes partition discovery consistent with Hive metastore.
Databricks recommends enabling partition metadata logging for improved read speeds and query performance for Unity Catalog external tables with partitions.
What is the default partition discovery strategy for Unity Catalog?
By default, Unity Catalog recursively lists all directories in the table location to automatically discover partitions. For large tables with many partition directories, this can increase latency for many table operations.
Use partition metadata logging
This feature is in Public Preview.
In Databricks Runtime 13.3 LTS and above, you can optionally enable partition metadata logging, a partition discovery strategy for external tables registered to Unity Catalog. This behavior is consistent with the partition discovery strategy used in Hive metastore. This behavior only impacts Unity Catalog external tables that have partitions and use Parquet, ORC, CSV, Avro, or JSON. Databricks recommends enabling the new behavior for improved read speeds and query performance for these tables.
Tables with partition metadata logging enabled demonstrate a behavioral change for partition discovery. Instead of automatically scanning the table location for partitions, Unity Catalog only respects partitions registered in the partition metadata. See Manually add, drop, or repair partition metadata.
This behavior will become the default in a future Databricks Runtime version. Tables with this feature enabled can only be read or written using Databricks Runtime 13.3 LTS and above.
Enable partition metadata logging
To enable partition metadata logging on a table, set the table property when you create an external table, as shown in the following example:
CREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>
USING <format>
PARTITIONED BY (<partition-column-list>)
TBLPROPERTIES ('partitionMetadataEnabled' = 'true')
LOCATION 's3://<bucket-path>/<table-directory>';
After you create a table with partition metadata logging enabled, Databricks uses the partition metadata to read the table in all subsequent workloads.
You can also use Spark conf to enable partition metadata for your current SparkSession. When enabled, external tables created in the SparkSession will be created with the partition metadata table property enabled. The Spark conf is disabled by default.
The following syntax demonstrates using SQL to set a Spark conf in a notebook. You can also set Spark configurations when configuring compute.
SET spark.databricks.nonDelta.partitionLog.enabled = true;
You can override the Spark conf by explicitly enabling or disabling the table property when a table is created.
You can only read and write tables with partition metadata logging enabled in Databricks Runtime 13.3 LTS and above. To read these tables using Databricks Runtime 12.2 LTS, you must drop and recreate the table with the partition metadata table property disabled.
External tables do not delete underlying data files when you drop them. Databricks recommends using CREATE OR REPLACE
syntax to upgrade tables to use partition metadata logging, as in the following example:
CREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>
USING <format>
PARTITIONED BY (<partition-column-list>)
LOCATION 's3://<bucket-path>/<table-directory>';
You can check whether a table was created using partition metadata by:
- Check the table properties section returned by
DESCRIBE EXTENDED table_name
. - Check the table metadata with Catalog Explorer.
The table properties contains partitionMetadataEnabled=true
.
Unity Catalog enforces rules on path overlaps for tables and volumes. You cannot register a new Unity Catalog table on a collection of data files if a table already exists in that location.
Work with tables with partition metadata
Databricks recommends using table names in all reads and writes against all tables registered to Unity Catalog. For tables with partition metadata, this guarantees that new partitions added to a table register to Unity Catalog and that queries against the table read all registered partitions.
Using path-based patterns for reads or writes can result in partitions being ignored or not registered to the Unity Catalog metastore. See Limitations.
List partitions
Use the following command to show all partitions registered to Unity Catalog as partition metadata:
SHOW PARTITIONS <table-name>
To check if a single partition is registered to Unity Catalog, use the following command:
SHOW PARTITIONS <table-name>
PARTITION (<partition-column-name> = <partition-column-value>)
Manually add, drop, or repair partition metadata
Unity Catalog requires that all partitions for external tables are contained within the directory registered using the LOCATION
clause during table registration.
With partition metadata enabled, automatic discovery of partitions in the table location is disabled. If external systems write data to the table location or you use path-based writes to add or overwrite records in your table, you must manually repair the partition metadata.
Databricks uses Hive-style partitioning for storing tables backed by Parquet, ORC, CSV, and JSON. Hive-style partitions contain key-value pairs connected by an equal-sign in the partition directory, for example year=2021/month=01/
.
If your table uses Hive-style partitioning, you can use MSCK REPAIR
to sync partition metadata in Unity Catalog with partitions that exist in the table location. The following syntax examples demonstrate common operations:
-- Add and remove parition metadata to match directories in table location
MSCK REPAIR TABLE <table_name> SYNC PARTITIONS;
-- Add partitions in the table location that are not registered as partition metadata
MSCK REPAIR TABLE <table_name> ADD PARTITIONS;
-- Drop partitions registered as partition metadata that are not in the table location
MSCK REPAIR TABLE <table_name> DROP PARTITIONS;
See REPAIR TABLE.
Manually specify paths for other partition types
If your table does not use Hive-style partitioning, you must manually specify partition locations when adding partitions. Manually specifying partitions can also reduce latency compared to MSCK REPAIR
syntax, especially for tables with a large number of partitions. The following syntax example shows adding a partition:
ALTER TABLE <table-name>
ADD PARTITION (<partition-column-name> = <partition-column-value>)
LOCATION 's3://<bucket-path>/<table-directory>/<partition-directory>';
You can also use ALTER TABLE
syntax to drop, rename, recover, and set locations for partitions. See ALTER TABLE … PARTITION.
Limitations
The following limitations exist:
- You cannot read or write to tables with partition metadata enabled using Databricks Runtime 12.2 LTS or below.
- Reading a table using the directory path returns all partitions including any that have been manually added or dropped.
- If you insert or overwrite records in a table using a path instead of a table name, the partition metadata is not recorded.
- Avro file format is not supported.