lakebase_vector
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
The lakebase_vector extension adds approximate nearest-neighbor (ANN) vector search to Lakebase via the lakebase_ann index type. It is a drop-in companion to pgvector: the same vector types, distance operators, and query syntax work without modification.
Install
First, enable Lakebase Search in your project settings. Then install the extension:
CREATE EXTENSION IF NOT EXISTS lakebase_vector CASCADE;
The CASCADE keyword automatically installs pgvector as a dependency.
Quick start
-- Create a table with a vector column
CREATE TABLE items (id BIGSERIAL PRIMARY KEY, embedding VECTOR(3));
-- Insert sample data
INSERT INTO items (embedding)
SELECT ARRAY[random(), random(), random()]::real[]
FROM generate_series(1, 1000);
-- Create a lakebase_ann index
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops);
-- Query using standard pgvector distance operators
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Configure the index
Set build_mode at index creation to control the accuracy/speed tradeoff:
standard(default): optimizes for recall. Use for most workloads.fast: builds faster at lower recall. Use when build time matters more than search quality.
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops) WITH (build_mode = 'fast');
Build indexes concurrently
Use CREATE INDEX CONCURRENTLY to build without locking the table, then REINDEX CONCURRENTLY to rebuild without downtime:
CREATE INDEX CONCURRENTLY items_embedding_ann ON items
USING lakebase_ann (embedding vector_l2_ops);
REINDEX INDEX CONCURRENTLY items_embedding_ann;
Tune search accuracy
Before tuning, call lakebase_ann_index_info(index_name) to get the index's lists, default_probes, and default_epsilon values.
Set lakebase_ann.probes at query time to control the accuracy/speed tradeoff. Higher values improve recall but slow queries.
SET lakebase_ann.probes TO '10';
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;
lakebase_ann.epsilon controls the re-ranking margin. The default value of 1.9 works well for most workloads.
SET lakebase_ann.epsilon TO '1.5';
Operator classes
Distance metric | Operator class | Query operator |
|---|---|---|
L2 (Euclidean) |
|
|
Inner product |
|
|
Cosine similarity |
|
|
Index options reference
Option | Type | Default | Description |
|---|---|---|---|
| string |
| Controls the accuracy/speed tradeoff at index build time. |
GUC reference
Parameter | Type | Default | Description |
|---|---|---|---|
| integer | (unset) | Number of IVF partitions to scan at query time. Higher values improve recall at the cost of query speed. |
| float |
| Re-ranking margin. Valid range: |
Utility functions
Function | Returns | Description |
|---|---|---|
| void | Loads an index into memory to eliminate cold-start latency on the first query. |
| text | Returns index metadata as text, including |