This article provides an overview of how you can partition tables on Databricks and specific recommendations around when you should use partitioning for tables backed by Delta Lake. Because of built-in features and optimizations, most tables with less than 1 TB of data do not require partitions.
In Databricks Runtime 8.4 and above, Databricks uses Delta Lake for all tables by default. The following recommendations assume you are working with Delta Lake for all tables.
In Databricks Runtime 11.2 and above, Databricks automatically clusters data in unpartitioned tables by ingestion time. See Use ingestion time clustering.
Databricks recommends you do not partition tables that contains less than a terabyte of data.
Databricks recommends all partitions contain at least a gigabyte of data. Tables with fewer, larger partitions tend to outperform tables with many smaller partitions.
By using Delta Lake and Databricks Runtime 11.2 or above, unpartitioned tables you create benefit automatically from ingestion time clustering. Ingestion time provides similar query benefits to partitioning strategies based on datetime fields without any need to optimize or tune your data.
To maintain ingestion time clustering when you perform a large number of modifications using
MERGE statements on a table, Databricks recommends running
ZORDER BY using a column that matches the ingestion order. For instance, this could be a column containing an event timestamp or a creation date.
While Databricks and Delta Lake build upon open source technologies like Apache Spark, Parquet, Hive, and Hadoop, partitioning motivations and strategies useful in these technologies do not generally hold true for Databricks. If you do choose to partition your table, consider the following facts before choosing a strategy:
Transactions are not defined by partition boundaries. Delta Lake ensures ACID through transaction logs, so you do not need to separate a batch of data by a partition to ensure atomic discovery.
Databricks compute clusters do not have data locality tied to physical media. Data ingested into the lakehouse is stored in cloud object storage. While data is cached to local disk storage during data processing, Databricks uses file-based statistics to identify the minimal amount of data for parallel loading.
You can use Z-order indexes alongside partitions to speed up queries on large datasets.
Most tables can leverage ingestion time clustering to avoid needing to worry about Z-order and partition tuning.
The following rules are important to keep in mind while planning a query optimization strategy based on partition boundaries and Z-order:
Z-order works in tandem with the
OPTIMIZEcommand. You cannot combine files across partition boundaries, and so Z-order clustering can only occur within a partition. For unpartitioned tables, files can be combined across the entire table.
Partitioning works well only for low or known cardinality fields (for example, date fields or physical locations), but not for fields with high cardinality such as timestamps. Z-order works for all fields, including high cardinality fields and fields that may grow infinitely (for example, timestamps or the customer ID in a transactions or orders table).
You cannot Z-order on fields used for partitioning.
Partitions can be beneficial, especially for very large tables. Many performance enhancements around partitioning focus on very large tables (hundreds of terabytes or greater).
Many customers migrate to Delta Lake from Parquet-based data lakes. The
CONVERT TO DELTA statement allows you to convert an existing Parquet-based table to a Delta table without rewriting existing data. As such, many customers have large tables that inherit previous partitioning strategies. Some optimizations developed by Databricks seek to leverage these partitions when possible, mitigating some potential downsides for partitioning strategies not optimized for Delta Lake.
Delta Lake and Apache Spark are open-source technologies. While Databricks continues to introduce features that reduce reliance on partitioning, the open source community might continue to build new features that add complexity.
Some experienced users of Apache Spark and Delta Lake might be able to design and implement a pattern that provides better performance than ingestion time clustering. Implementing a bad partitioning stategy can have very negative repercussions on downstream performance and might require a full rewrite of data to fix. Databricks recommends that most users use default settings to avoid introducing expensive inefficiencies.