Skip to main content

lakebase_vector

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

The lakebase_vector extension adds approximate nearest-neighbor (ANN) vector search to Lakebase via the lakebase_ann index type. It is a drop-in companion to pgvector: the same vector types, distance operators, and query syntax work without modification.

Install

First, enable Lakebase Search in your project settings. Then install the extension:

PostgreSQL
CREATE EXTENSION IF NOT EXISTS lakebase_vector CASCADE;

The CASCADE keyword automatically installs pgvector as a dependency.

Quick start

PostgreSQL
-- Create a table with a vector column
CREATE TABLE items (id BIGSERIAL PRIMARY KEY, embedding VECTOR(3));

-- Insert sample data
INSERT INTO items (embedding)
SELECT ARRAY[random(), random(), random()]::real[]
FROM generate_series(1, 1000);

-- Create a lakebase_ann index
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops);

-- Query using standard pgvector distance operators
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Configure the index

Set build_mode at index creation to control the accuracy/speed tradeoff:

  • standard (default): optimizes for recall. Use for most workloads.
  • fast: builds faster at lower recall. Use when build time matters more than search quality.
PostgreSQL
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops) WITH (build_mode = 'fast');

Build indexes concurrently

Use CREATE INDEX CONCURRENTLY to build without locking the table, then REINDEX CONCURRENTLY to rebuild without downtime:

PostgreSQL
CREATE INDEX CONCURRENTLY items_embedding_ann ON items
USING lakebase_ann (embedding vector_l2_ops);

REINDEX INDEX CONCURRENTLY items_embedding_ann;

Tune search accuracy

Before tuning, call lakebase_ann_index_info(index_name) to get the index's lists, default_probes, and default_epsilon values.

Set lakebase_ann.probes at query time to control the accuracy/speed tradeoff. Higher values improve recall but slow queries.

PostgreSQL
SET lakebase_ann.probes TO '10';
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;

lakebase_ann.epsilon controls the re-ranking margin. The default value of 1.9 works well for most workloads.

PostgreSQL
SET lakebase_ann.epsilon TO '1.5';

Operator classes

Distance metric

Operator class

Query operator

L2 (Euclidean)

vector_l2_ops

<->

Inner product

vector_ip_ops

<#>

Cosine similarity

vector_cosine_ops

<=>

Index options reference

Option

Type

Default

Description

build_mode

string

'standard'

Controls the accuracy/speed tradeoff at index build time. 'standard' optimizes for recall; 'fast' builds faster with lower recall.

GUC reference

Parameter

Type

Default

Description

lakebase_ann.probes

integer

(unset)

Number of IVF partitions to scan at query time. Higher values improve recall at the cost of query speed.

lakebase_ann.epsilon

float

1.9

Re-ranking margin. Valid range: 0.0 to 4.0.

Utility functions

Function

Returns

Description

lakebase_ann_prewarm(regclass)

void

Loads an index into memory to eliminate cold-start latency on the first query.

lakebase_ann_index_info(regclass)

text

Returns index metadata as text, including lists, default_probes, and default_epsilon.

Next steps