A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text. The Bloom filter operates by either stating that data is definitively not in the file, or that it is probably in the file, with a defined false positive probability (FPP).
Databricks supports file level Bloom filters; each data file can have a single Bloom filter index file associated with it. Before reading a file Databricks checks the index file and the file is read only if the index indicates that the file might match a data filter. Databricks always reads the data file if an index does not exist or if a Bloom filter is not defined for a queried column.
The size of a Bloom filter depends on the number elements in the set for which the Bloom filter has been created and the required FPP. The lower the FPP, the higher the number of used bits per element and the more accurate it will be, at the cost of more disk space and slower downloads. For example, an FPP of 10% requires 5 bits per element.
A Bloom filter index is an uncompressed Parquet file that contains a single row. Indexes are stored in the
_delta_index subdirectory relative to the data file and use the same name as the data file with the suffix
index.v1.parquet. For example, the index for data file
dbfs:/db1/data.0001.parquet.snappy would be named
Bloom filters support columns with the following (input) data types:
string. Nulls are not added to the Bloom filter, so any null related filter requires reading the data file. Databricks supports the following data source filters:
equalsnullsafe. Bloom filters are not supported on nested columns.
Bloom filters are enabled by default. To disable Bloom filters, set the session level
spark.databricks.io.skipping.bloomFilter.enabled configuration to
To create a Bloom filter index on a table for all columns or a subset of columns for new or rewritten data, use the
CREATE BLOOMFILTER INDEX DDL statement. For example, the following statement creates a Bloom filter index on the column
sha with FPP
50,000,000 distinct items in the column.
CREATE BLOOMFILTER INDEX ON TABLE bloom_test FOR COLUMNS(sha OPTIONS (fpp=0.1, numItems=50000000))
To drop all Bloom filters from a table or for a set of columns within a table, use the
DROP BLOOMFILTER INDEX DDL statement. For example:
DROP BLOOMFILTER INDEX ON TABLE bloom_test FOR COLUMNS(sha);