Bloom filter indexes

A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text. The Bloom filter operates by either stating that data is definitively not in the file, or that it is probably in the file, with a defined false positive probability (FPP).

Databricks supports file level Bloom filters; each data file can have a single Bloom filter index file associated with it. Before reading a file Databricks checks the index file and the file is read only if the index indicates that the file might match a data filter. Databricks always reads the data file if an index does not exist or if a Bloom filter is not defined for a queried column.

The size of a Bloom filter depends on the number elements in the set for which the Bloom filter has been created and the required FPP. The lower the FPP, the higher the number of used bits per element and the more accurate it will be, at the cost of more disk space and slower downloads. For example, an FPP of 10% requires 5 bits per element.

A Bloom filter index is an uncompressed Parquet file that contains a single row. Indexes are stored in the _delta_index subdirectory relative to the data file and use the same name as the data file with the suffix index.v1.parquet. For example, the index for data file dbfs:/db1/data.0001.parquet.snappy would be named dbfs:/db1/_delta_index/data.0001.parquet.snappy.index.v1.parquet.

Bloom filters support columns with the following (input) data types: byte, short, int, long, float, double, date, timestamp, and string. Nulls are not added to the Bloom filter, so any null related filter requires reading the data file. Databricks supports the following data source filters: and, or, in, equals, and equalsnullsafe. Bloom filters are not supported on nested columns.

Configuration

Bloom filters are enabled by default. To disable Bloom filters, set the session level spark.databricks.io.skipping.bloomFilter.enabled configuration to false.

Create a Bloom filter index

To create a Bloom filter index on a table for all columns or a subset of columns for new or rewritten data, use the CREATE BLOOMFILTER INDEX DDL statement. For example, the following statement creates a Bloom filter index on the column sha with FPP 0.1 and 50,000,000 distinct items in the column.

CREATE BLOOMFILTER INDEX
ON TABLE bloom_test
FOR COLUMNS(sha OPTIONS (fpp=0.1, numItems=50000000))

See Create Bloom Filter Index.

Drop a Bloom filter index

To drop all Bloom filters from a table or for a set of columns within a table, use the DROP BLOOMFILTER INDEX DDL statement. For example:

DROP BLOOMFILTER INDEX ON TABLE bloom_test FOR COLUMNS(sha);

See Drop Bloom Filter Index.

Notebook

The following notebook demonstrates how defining an Bloom filter index speeds up “needle in a haystack” queries.

Bloom filter demo notebook

Open notebook in new tab