lakebase_vector

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

The lakebase_vector extension adds approximate nearest-neighbor (ANN) vector search to Lakebase via the lakebase_ann index type. It is a drop-in companion to pgvector: the same vector types, distance operators, and query syntax work without modification.

Install

First, enable Lakebase Search in your project settings. Then install the extension:

PostgreSQL
CREATE EXTENSION IF NOT EXISTS lakebase_vector CASCADE;

The CASCADE keyword automatically installs pgvector as a dependency.

Quick start

PostgreSQL
-- Create a table with a vector column
CREATE TABLE items (id BIGSERIAL PRIMARY KEY, embedding VECTOR(3));

-- Insert sample data
INSERT INTO items (embedding)
SELECT ARRAY[random(), random(), random()]::real[]
FROM generate_series(1, 1000);

-- Create a lakebase_ann index
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops);

-- Query using standard pgvector distance operators
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Configure the index

Set build_mode at index creation to control the accuracy/speed tradeoff:

standard (default): optimizes for recall. Use for most workloads.
fast: builds faster at lower recall. Use when build time matters more than search quality.

PostgreSQL
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops) WITH (build_mode = 'fast');

Build indexes concurrently

Use CREATE INDEX CONCURRENTLY to build without locking the table, then REINDEX CONCURRENTLY to rebuild without downtime:

PostgreSQL
CREATE INDEX CONCURRENTLY items_embedding_ann ON items
  USING lakebase_ann (embedding vector_l2_ops);

REINDEX INDEX CONCURRENTLY items_embedding_ann;

Tune search accuracy

Before tuning, call lakebase_ann_index_info(index_name) to get the index's lists, default_probes, and default_epsilon values.

Set lakebase_ann.probes at query time to control the accuracy/speed tradeoff. Higher values improve recall but slow queries.

Before setting lakebase_ann.probes, call lakebase_ann_index_info to find your lists array. Set one probe value per list entry:

`lists` from index info	`probes` to set
`[]` (empty)
`[222]`	`'22'`
`[3333, 33333]`	`'33, 333'`

note

The lakebase_ann.probes parameter requires one value per entry in lists. When the lists array is empty (which happens on small tables where the index builder creates no IVF partitions), don't set probes. Setting a value when the lists array is empty causes an error. IVF partitions appear once your dataset is large enough for the index builder to partition it.

PostgreSQL
-- Check your index's lists length first
SELECT lakebase_ann_index_info('items_embedding_ann');

-- Set probes matching the lists array (example: one partition)
SET lakebase_ann.probes TO '22';
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;

lakebase_ann.epsilon controls the re-ranking margin. The default value of 1.9 works well for most workloads.

PostgreSQL
SET lakebase_ann.epsilon TO '1.5';

Operator classes

Distance metric	Operator class	Query operator
L2 (Euclidean)	`vector_l2_ops`	`<->`
Negative inner product	`vector_ip_ops`	`<#>`
Cosine similarity	`vector_cosine_ops`	`<=>`

Choose the operator class that matches how your embeddings were trained, and use the same metric for the index and the query:

vector_cosine_ops (<=>) is cosine similarity. Use it for most text embeddings. This is the most common choice.
vector_l2_ops (<->) is Euclidean (L2) distance. Use it when absolute spatial distance matters and vectors are not normalized.
vector_ip_ops (<#>) is negative inner product. Use it when vectors are pre-normalized to unit length. For unit vectors, inner product equals cosine similarity and is typically faster.

Index options reference

Option	Type	Default	Description
`build_mode`	string	`'standard'`	Controls the accuracy/speed tradeoff at index build time. `'standard'` optimizes for recall; `'fast'` builds faster with lower recall.

GUC reference

Parameter	Type	Default	Description
`lakebase_ann.probes`	integer[]	(unset)	Array of per-partition probe counts, one value per entry in `lists`. Higher values improve recall at the cost of query speed. Check `lakebase_ann_index_info` for the `lists` length to determine how many values to set.
`lakebase_ann.epsilon`	float	`1.9`	Re-ranking margin. Valid range: `0.0` to `4.0`.

Parameter	Type	Default	Description
`lakebase_ann.probes`	integer[]	(unset)	Array of per-partition probe counts, one value per entry in `lists`. Higher values improve recall at the cost of query speed. Check `lakebase_ann_index_info` for the `lists` length to determine how many values to set.
`lakebase_ann.epsilon`	float	`1.9`	Re-ranking margin. Valid range: `0.0` to `4.0`.

Utility functions

Function	Returns	Description
`lakebase_ann_prewarm(regclass)`	void	Loads an index into memory to eliminate cold-start latency on the first query.
`lakebase_ann_index_info(regclass)`	text	Returns index metadata as text, including `lists`, `default_probes`, and `default_epsilon`.

Function	Returns	Description
`lakebase_ann_prewarm(regclass)`	void	Loads an index into memory to eliminate cold-start latency on the first query.
`lakebase_ann_index_info(regclass)`	text	Returns index metadata as text, including `lists`, `default_probes`, and `default_epsilon`.

Install​

Quick start​

Configure the index​

Build indexes concurrently​

Tune search accuracy​

Operator classes​

Index options reference​

GUC reference​

Utility functions​

Next steps​