CREATE BLOOM FILTER INDEX

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Creates a Bloom filter index for new or rewritten data; it does not create Bloom filters for existing data. The command fails if either the table name or one of the columns does not exist. If Bloom filtering is enabled for a column, existing Bloom filter options are replaced by the new options.

Syntax

CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
FOR COLUMNS( { columnName1 [ options ] } [, ...] )
[ options ]

options
  OPTIONS ( { key1 [ = ] val1 } [, ...] )

Parameters

While it is not possible to build a Bloom filter index for data that is already written, the OPTIMIZE command updates Bloom filters for data that is reorganized. Therefore, you can backfill a Bloom filter by running OPTIMIZE on a table:

  • If you have not previously optimized the table.

  • With a different file size, requiring that the data files be re-written.

  • With a ZORDER (or a different ZORDER, if one is already present), requiring that the data files be re-written.

You can tune the Bloom filter by defining options at the column level or at the table level:

  • fpp: False positive probability. The desired false positive rate per written Bloom filter. This influences the number of bits needed to put a single item in the Bloom filter and influences the size of the Bloom filter. The value must be larger than 0 and smaller than or equal to 1. The default value is 0.1 which requires 5 bits per item.

  • numItems: Number of distinct items the file can contain. This setting is important for the quality of filtering as it influences the total number of bits used in the Bloom filter (number of items - number of bits per item). If this setting is incorrect, the Bloom filter is either very sparsely populated, wasting disk space and slowing queries that must download this file, or it is too full and is less accurate (higher FPP). The value must be larger than 0. The default is 1 million items.

  • maxExpectedFpp: The maximum expected false positive probability at which a Bloom filter is written to disk. If the expected FPP is larger than this threshold, the Bloom filter’s selectivity is too low; the time and resources it takes to use the Bloom filter outweighs its usefulness. The value must be between 0 and 1. The default is 1.0 (disabled).

These options play a role only when writing the data. You can configure these properties at various hierarchical levels: write operation, table level, and column level. The column level takes precedence over the table and operation levels, and the table level takes precedence over the operation level.

See Bloom filter indexes.