Dynamic file pruning

Dynamic file pruning (DFP), can significantly improve the performance of many queries on Delta tables. DFP is especially efficient for non-partitioned tables, or for joins on non-partitioned columns. The performance impact of DFP is often correlated to the clustering of data so consider using Z-Ordering to maximize the benefit of DFP.

For background and use cases for DFP, see Faster SQL Queries on Delta Lake with Dynamic File Pruning.

Note

Available in Databricks Runtime 6.1 and above.

DFP is controlled by the following Apache Spark configuration options:

  • spark.databricks.optimizer.dynamicPartitionPruning (default is true): The main flag that directs the optimizer to push down DFP filters. When set to false, DFP will not be in effect.
  • spark.databricks.optimizer.deltaTableSizeThreshold (default is 10,000,000,000 bytes (10 GB)): Represents the minimum size (in bytes) of the Delta table on the probe side of the join required to trigger DFP. If the probe side is not very large, it is probably not worthwhile to push down the filters and we can just simply scan the whole table. You can find the size of a Delta table by running the DESCRIBE DETAIL table_name command and then looking at the sizeInBytes column.
  • spark.databricks.optimizer.deltaTableFilesThreshold (defaults is 1000): Represents the number of files of the Delta table on the probe side of the join required to trigger DFP. When the probe side table contains fewer files than the threshold value, DPP will not be triggered. If a table has only a few files, it is probably not worthwhile to enable DFP. You can find the size of a Delta table can be found by running the DESCRIBE DETAIL table_name command and then looking at the numFiles column.