lakebase_text

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

The lakebase_text extension adds BM25 full-text search to Lakebase via the lakebase_bm25 index type. It is compatible with PostgreSQL's standard tsvector type and query operators.

Install

First, enable Lakebase Search in your project settings. Then install the extension:

PostgreSQL
CREATE EXTENSION IF NOT EXISTS lakebase_text;

Why lakebase_text instead of standard GIN full-text search

PostgreSQL's built-in full-text search uses GIN indexes and ts_rank for relevance scoring. ts_rank does not use global corpus statistics, so scores degrade as data grows. lakebase_text improves on this in two ways:

BM25 ranking accounts for term frequency, document length, and corpus-wide statistics simultaneously, producing more accurate relevance scores than TF-IDF.
Top-K pushdown uses Block-Max WAND to return only the K most relevant results from the index, without scoring every match in the result set.

Quick start

Build the lakebase_bm25 index after inserting data. BM25 computes corpus-wide statistics at index build time, not incrementally, so the index must be created on a populated table.

PostgreSQL
-- Create a table with a generated tsvector column
CREATE TABLE documents (
  id      SERIAL PRIMARY KEY,
  passage TEXT,
  vector  TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', passage)) STORED
);

-- Insert data before building the BM25 index
INSERT INTO documents (passage) VALUES
  ('Postgres is a powerful open-source relational database.'),
  ('Vector search finds semantically similar results.'),
  ('BM25 ranking improves full-text search relevance scores.');

-- Create the BM25 index on the populated table
CREATE INDEX documents_passage_bm25 ON documents USING lakebase_bm25 (vector);

-- Query: lower score means more relevant
SELECT id, passage,
  vector <@> to_bm25query(to_tsvector('english', 'database'), 'documents_passage_bm25') AS score
FROM documents
ORDER BY score
LIMIT 5;

The <@> operator returns a negative BM25 score. Ordering by ascending score returns the most relevant results first.

Keep the index accurate

BM25 statistics are computed at index build time and updated by VACUUM. For most workloads, regular VACUUM keeps scores accurate. After bulk-loading a large amount of new data, run VACUUM manually:

PostgreSQL
VACUUM documents;

Tune search

Session-level GUCs

Parameter	Type	Default	Description
`lakebase_bm25.default_limit`	integer	`1000`	Maximum number of results returned from the index.
`lakebase_bm25.prefilter`	boolean	`false`	When `true`, evaluates `WHERE` conditions before computing BM25 scores. Use when filters eliminate many rows and are cheap to evaluate.
`lakebase_bm25.enable_scan`	boolean	`true`	Set to `false` to force a sequential scan, bypassing the index. Useful for testing.

Parameter	Type	Default	Description
`lakebase_bm25.default_limit`	integer	`1000`	Maximum number of results returned from the index.
`lakebase_bm25.prefilter`	boolean	`false`	When `true`, evaluates `WHERE` conditions before computing BM25 scores. Use when filters eliminate many rows and are cheap to evaluate.
`lakebase_bm25.enable_scan`	boolean	`true`	Set to `false` to force a sequential scan, bypassing the index. Useful for testing.

PostgreSQL
SET lakebase_bm25.default_limit TO 20;
SET lakebase_bm25.prefilter = on;

GUCs take precedence over index storage parameters when both are set.

Index storage parameters

Set these options at index creation time or with ALTER INDEX:

Parameter	Type	Default	Range	Description
`k1`	real	`1.2`	1.2 to 2.0	Term frequency saturation. Higher values give more weight to repeated terms.
`b`	real	`0.75`	0.0 to 1.0	Document length normalization. `0.0` disables length normalization; `1.0` applies full normalization.
`default_limit`	integer	`1000`	1 to 65535	Fallback limit when the session GUC is not set.
`prefilter`	boolean	`false`	N/A	Fallback prefilter setting when the session GUC is not set.

Parameter	Type	Default	Range	Description
`k1`	real	`1.2`	1.2 to 2.0	Term frequency saturation. Higher values give more weight to repeated terms.
`b`	real	`0.75`	0.0 to 1.0	Document length normalization. `0.0` disables length normalization; `1.0` applies full normalization.
`default_limit`	integer	`1000`	1 to 65535	Fallback limit when the session GUC is not set.
`prefilter`	boolean	`false`	N/A	Fallback prefilter setting when the session GUC is not set.

PostgreSQL
-- Set parameters at index creation (use a new name — the Quick start already created documents_passage_bm25)
CREATE INDEX documents_passage_bm25_tuned ON documents USING lakebase_bm25 (vector)
  WITH (default_limit = 20, k1 = 1.5);

-- Update parameters on an existing index
ALTER INDEX documents_passage_bm25_tuned SET (default_limit = 50);

API reference

Types

bm25query_tsvector: combines a query tsvector with the target index identifier. Used as the right operand of <@>.

Operators

Operator	Signature	Returns	Description
`<@>`	`tsvector <@> bm25query_tsvector`	`double precision`	Returns a negative BM25 score. Order ascending to get the most relevant results first.

Functions

Function	Returns	Description
`to_bm25query(query tsvector, index regclass)`	`bm25query_tsvector`	Constructs a BM25 query object from a `tsvector` and the index's object identifier.

Operator classes

Class	Default for	Description
`tsvector_bm25_ops`	`tsvector`	Maps `tsvector` columns to the `<@>` operator for BM25 scoring. This is the default operator class for `tsvector` with `lakebase_bm25`; you do not need to specify it explicitly.

Class	Default for	Description
`tsvector_bm25_ops`	`tsvector`	Maps `tsvector` columns to the `<@>` operator for BM25 scoring. This is the default operator class for `tsvector` with `lakebase_bm25`; you do not need to specify it explicitly.

Install​

Why lakebase_text instead of standard GIN full-text search​

Quick start​

Keep the index accurate​

Tune search​

Session-level GUCs​

Index storage parameters​

API reference​

Types​

Operators​

Functions​

Operator classes​

Next steps​