Skip to main content

Lakehouse Sync

info

Lakebase Autoscaling is available in the following regions: us-east-1, us-east-2, us-west-2, ca-central-1, sa-east-1, eu-central-1, eu-west-1, eu-west-2, ap-south-1, ap-southeast-1, ap-southeast-2.

Lakebase Autoscaling is the latest version of Lakebase, with autoscaling compute, scale-to-zero, branching, and instant restore. If you are a Lakebase Provisioned user, see Lakebase Provisioned.

note

Lakehouse Sync is in Beta.

What is Lakehouse Sync?

Lakehouse Sync enables continuous, low-latency replication of your Lakebase Postgres tables into Unity Catalog managed Delta tables by capturing row-level changes and writing them as SCD Type 2 history. Each change is appended as a new row, so you keep a full history of how rows changed over time. This sync does not require any external compute, pipelines, or jobs. It is a native Lakebase feature.

Lakehouse Sync data flow from applications through Lakebase to Lakehouse tables in Unity Catalog

Lakehouse Sync uses Change Data Capture (CDC) to stream changes from your Lakebase Postgres database into Unity Catalog. Delta tables are named following the format lb_<table_name>_history in your chosen catalog and schema. Each change (insert, update, delete) is appended as a row, so you keep a full history of how your data evolved over time.

Example use cases

The following are example use cases for Lakehouse Sync, where you stream change data from your Lakebase transactional database into the lakehouse.

Use case

Description

Fast analytics

Run aggregates and analytics on Lakebase data.

Medallion source

Use Lakebase as the source of your medallion architecture. Delta tables can be processed with Databricks pipelines, Spark Declarative Pipelines (SDP), or Delta Live Tables (DLT) to build downstream tables.

Full history in the lakehouse

Preserve the complete history of all changes in the lakehouse while optionally keeping only a subset of data in Lakebase.

Requirements

  • Lakebase Autoscaling: A Lakebase Autoscaling project running Postgres 17.
  • Source database: Tables must reside in the databricks_postgres database in Lakebase (beta limitation). Each project is created with this default database.
  • Data types: Tables must use only supported column data types. Unsupported types cause the sync to fail for that table.
  • Unity Catalog: The identity configuring the sync needs USE CATALOG, USE SCHEMA, and CREATE TABLE on the destination catalog and schema. See Grant permissions on an object.
  • Lakebase project: Your Postgres role requires CAN MANAGE permissions on the Lakebase project that you are syncing from. If your identity owns the Lakebase project, it has CAN MANAGE permissions by default. See Manage project permissions.

To get started: set replica identity full on the tables you want to sync (Step 1), then start the sync in the Lakebase app (Step 2). Your data appears as lb_<table_name>_history tables in the Unity Catalog catalog and schema you choose. If you want to learn more about how Lakehouse Sync works before you start, see How Lakehouse Sync works.

Configuring the sync

Step 1: Set replica identity full

For a Lakebase table to sync successfully, it must have replica identity set to full. You can configure the sync on an empty schema or one that already contains tables. Partitioned tables are not supported.

By default, Postgres logs only the primary key when a row is updated or deleted. Setting REPLICA IDENTITY FULL tells Postgres to record the full before-and-after row state in the write-ahead log. This is required so the Lakehouse Sync can build a complete update history.

You can run the following commands in the Lakebase SQL Editor or from any Postgres client. This example uses the Lakebase SQL Editor. To open it: in your Databricks workspace, open Lakebase Postgres from the app switcher (top right), select your project and the branch you want to sync (for example, production or main), then select SQL Editor from the sidebar and choose the branch and database. See Query from Lakebase SQL Editor for details.

To set replica identity to full on a single table, run:

SQL
ALTER TABLE <table_name> REPLICA IDENTITY FULL;

Replace <table_name> with your table name.

  • Existing tables: Run this on all existing tables in the schema before starting the sync.
  • New tables: For tables created after the sync is configured, run this before inserting any data. Rows inserted before this property is set are not synced.

Check which tables have replica identity set

To see which tables have replica identity set (and which are full), run in Lakebase:

SQL
SELECT n.nspname AS table_schema,
c.relname AS table_name,
CASE c.relreplident
WHEN 'd' THEN 'default'
WHEN 'n' THEN 'nothing'
WHEN 'f' THEN 'full'
WHEN 'i' THEN 'index'
END AS replica_identity
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND n.nspname = 'public'
ORDER BY n.nspname, c.relname;

Change n.nspname = 'public' to your schema name if different. Only rows with replica_identity = full are ready for the sync.

Step 2: Start the Lakehouse Sync

Lakehouse Sync is configured at the schema level. Once set up, all current and future tables in that schema sync to Unity Catalog.

  1. In your Databricks workspace, open Lakebase Postgres from the app switcher (top right).
  2. Select your Lakebase project and the branch you want to sync (for example, production or main).
  3. Open Branch overview and click the Lakehouse sync tab.
  4. Click Start sync.
  5. In the configuration dialog:
    • Database: Defaults to databricks_postgres.
    • Schema: Select the source Postgres schema to sync.
    • To Catalog: Select the destination Unity Catalog catalog.
    • Schema: Select the destination Unity Catalog schema.
  6. Click Start sync to begin syncing data.

Branch overview with Lakehouse sync tab showing Start sync and schema configuration

Tables appear in your chosen Unity Catalog catalog and schema as lb_<table_name>_history. In the Lakehouse, open Catalog in the sidebar, go to your destination catalog and schema, then open the Tables tab on the schema's Overview to see the Delta tables. On the Lakehouse sync tab in Lakebase you can confirm status and inspect what is syncing.

What you see on the Lakehouse sync tab

When sync is enabled, the top of the tab shows Status: Syncing and that changes are being captured and synced to Delta tables.

Sub-tabs show the mapping and per-table progress

Two sub-tabs show the mapping and per-table progress:

  • Schemas: Lists each source schema and its destination catalog and schema in Unity Catalog, with a Status (for example, Syncing) for that schema.
  • Tables: Lists each source table, its destination lb_<table_name>_history table in Unity Catalog, Status (Syncing or Snapshotting), Committed LSN (how far the sync has written to Delta; shown as - while a table is still in initial snapshot), and Last Sync (when the table was last synced).

You can also run SELECT * FROM wal2delta.tables; from the Lakebase SQL Editor (or any Postgres client) to inspect sync status. The result includes table_oid, status (for example, STREAMING or SNAPSHOTTING), committed_lsn, and last_write_time for each table.

info

What is wal2delta? Lakehouse Sync is powered by the wal2delta Postgres extension, which runs inside the Lakebase compute. It uses logical decoding to capture write-ahead log (WAL) changes and writes them to Delta tables in Unity Catalog.

Disabling the sync

Disabling the sync stops replication for all Lakebase schemas that were syncing.

  1. In your Databricks workspace, open Lakebase Postgres from the app switcher (top right).
  2. Select your Lakebase project and the branch where you configured the sync (for example, production or main).
  3. Open Branch overview and click the Lakehouse sync tab.
  4. Click Disable sync. In the confirmation dialog, review the warning that changes will stop syncing to Delta tables, then click Disable again to confirm.

Disabling the sync does not restart your compute.

warning

If you re-enable the sync later, the system does not perform a full re-snapshot. Any changes that occurred while the sync was disabled are permanently missing from the destination Delta table.

How Lakehouse Sync works

Instead of overwriting the destination Unity Catalog table to mirror the current state of your data in Lakebase, the sync appends a new row for every change event. This provides a complete, immutable log of how your data has evolved over time.

  • Destination naming: Delta tables are created in Unity Catalog and use the naming pattern lb_<table_name>_history. These are UC managed Delta tables.
  • Schema-level sync: When you configure the sync for a Lakebase schema, all current and future tables in that schema are synced. Empty tables are not synced. There must be at least one row in the Lakebase table for it to appear in the sync.
  • Dropped tables: If you drop a table in Lakebase, the destination Delta table in Unity Catalog is preserved (not dropped).

You can monitor sync status in the Lakehouse sync tab in the branch overview or by running SELECT * FROM wal2delta.tables; in Lakebase.

Destination Delta table schema

In addition to your data columns, the sync adds these system columns to each destination Delta table in Unity Catalog:

Column

Type

Description

_change_type

TEXT

Operation type: insert, delete, update_preimage, or update_postimage.

_timestamp

TIMESTAMP

Transaction commit time in Postgres (without timezone).

_lsn

BIGINT

Postgres Log Sequence Number.

_xid

INTEGER

Postgres Transaction ID.

Common change patterns

These patterns appear in the destination Delta tables in Unity Catalog:

  • Initial load: The first time the sync runs on an existing Lakebase table, each existing row is written with _change_type = insert.
  • Updates: An update produces two rows: one with _change_type = update_preimage (old row) and one with _change_type = update_postimage (new row).
  • Deletes: A delete produces one row with _change_type = delete.

Limitations and troubleshooting

You can see the table status (which tables are snapshotting, skipped, or streaming) in the Lakehouse sync tab or by running in Lakebase:

SQL
SELECT * FROM wal2delta.tables;

Common reasons a table does not sync:

  • REPLICA IDENTITY FULL not set: Ensure you ran ALTER TABLE <table_name> REPLICA IDENTITY FULL; for each table.
  • Partitioned tables: Lakebase partitioned tables are not supported. Syncing a schema that contains partitioned tables causes those tables to fail to sync.
  • Naming collisions: Delta tables are named lb_<table_name>_history without the source Postgres schema prefix. If you sync two different Postgres schemas (for example, sales.users and marketing.users) to the same Unity Catalog schema, the first table syncs and the second fails due to the name collision. Map Postgres schemas that share table names to different Unity Catalog schemas.
  • Unsupported data types: If a table has an unsupported type, creation of that Delta table fails and the table is not synced. See Data type mapping.

Data type mapping

The sync supports most standard PostgreSQL primitive types. Unsupported types cause the Delta table creation to fail for that table.

PostgreSQL type

Databricks Delta type

Notes

BOOLEAN

BOOLEAN

INT, SMALLINT, BIGINT

INT, SMALLINT, BIGINT

TEXT, VARCHAR, CHAR

STRING

JSONB

STRING

Stored as a JSON string.

ENUM

STRING

Stored as the enum label.

NUMERIC / DECIMAL

DECIMAL

Uses source precision/scale. Defaults to DECIMAL(38, 18) if undefined.

DATE

DATE

TIMESTAMP

TIMESTAMP_NTZ

TIMESTAMPTZ

TIMESTAMP

FLOAT, DOUBLE

FLOAT, DOUBLE

Unsupported types:

  • Geography/Geometry (PostGIS): Types from the PostGIS extension (for example, geometry, geography).
  • Vector (pgvector): The vector type from the pgvector extension.
  • Composite/struct types: Custom types defined with CREATE TYPE ... AS (field_name type, ...). These are row-like types with named fields (sometimes called structs). Columns whose type is a composite type cannot be synced.
  • Map: Map-like key-value types such as hstore (from the hstore extension). Postgres has no built-in map type; hstore is the usual way to store key-value pairs in a column.

Find columns that may block sync

To find columns that may block sync, run the following query in Lakebase. It lists every column whose type is not in the supported set above (including enums). Any table that has one or more such columns does not sync.

SQL
SELECT c.table_schema, c.table_name, c.column_name, c.udt_name AS data_type
FROM information_schema.columns c
JOIN pg_catalog.pg_type t ON t.typname = c.udt_name
WHERE c.table_schema = 'public'
AND c.table_name IN (SELECT tablename FROM pg_tables WHERE schemaname = c.table_schema)
AND NOT (
c.udt_name IN ('bool', 'int2', 'int4', 'int8', 'text', 'varchar', 'bpchar', 'jsonb', 'numeric', 'date', 'timestamp', 'timestamptz', 'real', 'float4', 'float8')
OR t.typcategory = 'E'
)
ORDER BY c.table_schema, c.table_name, c.ordinal_position;

Managing schema changes

Supported: Renaming a table in Postgres (for example, ALTER TABLE users RENAME TO customers) lets the sync continue. The destination Delta table name does not change. It remains lb_users_history.

Breaking: Adding a column, dropping a column, or changing a column's data type is not applied automatically to the destination. Such changes cause the sync for that table to stop receiving new events.

Workaround: To apply schema changes safely without breaking the sync, use the following steps to create a new table and swap it into place. This triggers the sync to recognize a new table and start a fresh history.

  1. Create the new table with the updated schema.

    SQL
    CREATE TABLE users_v2 (
    id INT PRIMARY KEY,
    name TEXT,
    new_column TEXT -- The new schema change
    );
  2. Set replica identity full.

    SQL
    ALTER TABLE users_v2 REPLICA IDENTITY FULL;
  3. Backfill data from the old table.

    SQL
    INSERT INTO users_v2 SELECT *, NULL FROM users;
  4. Swap the table names.

    SQL
    BEGIN;
    ALTER TABLE users RENAME TO users_backup;
    ALTER TABLE users_v2 RENAME TO users;
    COMMIT;
note

With this approach, you lose the history of the previous table. The new table begins its sync history from scratch.

Deduplicating to create a current-state mirror

For analytics that need the current state of the Lakebase table (a mirror) rather than full history, you can deduplicate in Databricks SQL (for example, in the Lakehouse SQL editor or a notebook attached to a SQL warehouse) using a window function:

SQL
SELECT
*
FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY _lsn DESC) AS rn
FROM
`<catalog>.<schema>.lb_<table_name>_history`
WHERE
_change_type IN ('insert', 'update_postimage', 'delete')
)
WHERE
rn = 1
AND _change_type != 'delete';

Replace id with your table's primary key column. The query partitions by that key, orders by _lsn to get the latest event, and keeps only the latest row per key while excluding rows whose latest event was a delete.

Next steps

Depending on your goal, use the lb_<table_name>_history Delta tables with other Databricks features: