Bloom filter indexes

A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text.

Note

In Databricks Runtime 13.2 and above, Databricks recommends using clustering for Delta table layout. See Use liquid clustering for Delta tables.

How Bloom filter indexes work

The Bloom filter operates by either stating that data is definitively not in the file, or that it is probably in the file, with a defined false positive probability (FPP).

Databricks supports file level Bloom filters; each data file can have a single Bloom filter index file associated with it. Before reading a file Databricks checks the index file and the file is read only if the index indicates that the file might match a data filter. Databricks always reads the data file if an index does not exist or if a Bloom filter is not defined for a queried column.

The size of a Bloom filter depends on the number elements in the set for which the Bloom filter has been created and the required FPP. The lower the FPP, the higher the number of used bits per element and the more accurate it will be, at the cost of more disk space and slower downloads. For example, an FPP of 10% requires 5 bits per element.

A Bloom filter index is an uncompressed Parquet file that contains a single row. Indexes are stored in the _delta_index subdirectory relative to the data file and use the same name as the data file with the suffix index.v1.parquet. For example, the index for data file dbfs:/db1/data.0001.parquet.snappy would be named dbfs:/db1/_delta_index/data.0001.parquet.snappy.index.v1.parquet.

Bloom filters support columns with the following (input) data types: byte, short, int, long, float, double, date, timestamp, and string. Nulls are not added to the Bloom filter, so any null related filter requires reading the data file. Databricks supports the following data source filters: and, or, in, equals, and equalsnullsafe. Bloom filters are not supported on nested columns.

Configuration and reference

Bloom filters are enabled by default. To disable Bloom filters, set the session level spark.databricks.io.skipping.bloomFilter.enabled configuration to false.

Display the list of Bloom filter indexes

To display the list of indexes, run:

spark.table("<table-with-indexes>").schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))

For example:

Show indexes

Example notebook: Bloom filter indexing

The following notebook demonstrates defining a Bloom filter index to speed up “needle in a haystack” queries.

Bloom filter example notebook

Open notebook in new tab