lakebase_text
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
The lakebase_text extension adds BM25 full-text search to Lakebase via the lakebase_bm25 index type. It is compatible with PostgreSQL's standard tsvector type and query operators.
Install
First, enable Lakebase Search in your project settings. Then install the extension:
CREATE EXTENSION IF NOT EXISTS lakebase_text;
Why lakebase_text instead of standard GIN full-text search
PostgreSQL's built-in full-text search uses GIN indexes and ts_rank for relevance scoring. ts_rank does not use global corpus statistics, so scores degrade as data grows. lakebase_text improves on this in two ways:
- BM25 ranking accounts for term frequency, document length, and corpus-wide statistics simultaneously, producing more accurate relevance scores than TF-IDF.
- Top-K pushdown uses Block-Max WAND to return only the K most relevant results from the index, without scoring every match in the result set.
Quick start
Build the lakebase_bm25 index after inserting data. BM25 computes corpus-wide statistics at index build time, not incrementally, so the index must be created on a populated table.
-- Create a table with a generated tsvector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
passage TEXT,
vector TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', passage)) STORED
);
-- Insert data before building the BM25 index
INSERT INTO documents (passage) VALUES
('Postgres is a powerful open-source relational database.'),
('Vector search finds semantically similar results.'),
('BM25 ranking improves full-text search relevance scores.');
-- Create the BM25 index on the populated table
CREATE INDEX documents_passage_bm25 ON documents USING lakebase_bm25 (vector);
-- Query: lower score means more relevant
SELECT id, passage,
vector <@> to_bm25query(to_tsvector('english', 'database'), 'documents_passage_bm25') AS score
FROM documents
ORDER BY score
LIMIT 5;
The <@> operator returns a negative BM25 score. Ordering by ascending score returns the most relevant results first.
Keep the index accurate
BM25 statistics are computed at index build time and updated by VACUUM. For most workloads, regular VACUUM keeps scores accurate. After bulk-loading a large amount of new data, run VACUUM manually:
VACUUM documents;
Tune search
Session-level GUCs
Parameter | Type | Default | Description |
|---|---|---|---|
| integer |
| Maximum number of results returned from the index. |
| boolean |
| When |
| boolean |
| Set to |
SET lakebase_bm25.default_limit TO 20;
SET lakebase_bm25.prefilter = on;
GUCs take precedence over index storage parameters when both are set.
Index storage parameters
Set these options at index creation time or with ALTER INDEX:
Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
| real |
| 1.2 to 2.0 | Term frequency saturation. Higher values give more weight to repeated terms. |
| real |
| 0.0 to 1.0 | Document length normalization. |
| integer |
| 1 to 65535 | Fallback limit when the session GUC is not set. |
| boolean |
| N/A | Fallback prefilter setting when the session GUC is not set. |
-- Set parameters at index creation
CREATE INDEX documents_passage_bm25 ON documents USING lakebase_bm25 (vector)
WITH (default_limit = 20, k1 = 1.5);
-- Update parameters on an existing index
ALTER INDEX documents_passage_bm25 SET (default_limit = 50);
API reference
Types
bm25query_tsvector: combines a query tsvector with the target index identifier. Used as the right operand of <@>.
Operators
Operator | Signature | Returns | Description |
|---|---|---|---|
|
|
| Returns a negative BM25 score. Order ascending to get the most relevant results first. |
Functions
Function | Returns | Description |
|---|---|---|
|
| Constructs a BM25 query object from a |
Operator classes
Class | Default for | Description |
|---|---|---|
|
| Maps |