CREATE BLOOM FILTER INDEX (Delta Lake on Databricks)

Creates a Bloom filter index for new or rewritten data; it does not create Bloom filters for existing data. The command fails if either the table name or one of the columns does not exist. If Bloom filtering is enabled for a column, existing Bloom filter options are replaced by the new options.


ON [TABLE] table_identifier
[FOR COLUMNS(columnName1 [OPTIONS(..)], columnName2, ...)]
  • table_identifier
    • [database_name.] table_name: A table name, optionally qualified with a database name.
    • delta.`<path-to-table>` : The location of an existing Delta table.

While it is not possible to build a Bloom filter index for data that is already written, the OPTIMIZE command updates Bloom filters for data that is reorganized. Therefore, you can backfill a Bloom filter by running OPTIMIZE on a table:

  • If you have not previously optimized the table.
  • With a different file size, requiring that the data files be re-written.
  • With a ZORDER (or a different ZORDER, if one is already present), requiring that the data files be re-written.

You can tune the Bloom filter by defining options at the column level or at the table level:

  • fpp: False positive probability. The desired false positive rate per written Bloom filter. This influences the number of bits needed to put a single item in the Bloom filter and influences the size of the Bloom filter. The value must be larger than 0 and smaller than or equal to 1. The default value is 0.1 which requires 5 bits per item.
  • numItems: Number of distinct items the file can contain. This setting is important for the quality of filtering as it influences the total number of bits used in the Bloom filter (number of items * number of bits per item). If this setting is incorrect, the Bloom filter is either very sparsely populated, wasting disk space and slowing queries that must download this file, or it is too full and is less accurate (higher FPP). The value must be larger than 0. The default is 1 million items.
  • maxExpectedFpp: The expected FPP threshold for which a Bloom filter is not written to disk. The maximum expected false positive probability at which a Bloom filter is written. If the expected FPP is larger than this threshold, the Bloom filter’s selectivity is too low; the time and resources it takes to use the Bloom filter outweighs its usefulness. The value must be between 0 and 1. The default is 1.0 (disabled).

These options play a role only when writing the data. You can configure these properties at various hierarchical levels: write operation, table level, and column level. The column level takes precedence over the table and operation levels, and the table level takes precedence over the operation level.

See Bloom filter indexes.