Configure PostgreSQL for ingestion into Databricks

Preview

The PostgreSQL connector for Lakeflow Connect is in Public Preview. Reach out to your Databricks account team to enroll in the Public Preview.

This page describes the source setup tasks for ingestion from PostgreSQL into Databricks using Lakeflow Connect.

Logical replication for change data capture

The PostgreSQL connector uses logical replication to track changes in the source tables. Logical replication allows the connector to capture data modifications (inserts, updates, and deletes) without requiring triggers or significant overhead on the source database.

Lakeflow PostgreSQL Logical replication requires the following:

Lakeflow Connect supports data replication from PostgreSQL version 13 and later.
Configure the database for logical replication:

The PostgreSQL parameter wal_level must be set to logical.
Create publications that include all tables you want to replicate.
Create replication slots for each catalog that will be replicated.

note

Publications must be created before creating replication slots.

For more information about logical replication, see the Logical Replication documentation on the PostgreSQL website.

Overview of source setup tasks

Complete the following tasks in PostgreSQL before ingesting data into Databricks:

Verify PostgreSQL 13 or above
Configure network access (security groups, firewall rules, or VPN)
Configure logical replication:
- Enable logical replication (wal_level = logical)
- Create replication user with required privileges. See PostgreSQL database user requirements
- Set replica identity for tables. See Set replica identity for tables
- Creating publications and replication slots
Optional: Configure inline DDL tracking for automatic schema change detection. In case you want to opt for the inline DDL tracking, reach out to Databricks support.

important

If you plan to replicate from multiple PostgreSQL databases, you must create a separate publication and replication slot for each database. The inline DDL tracking script (if used) must also be executed in each database.

Configure logical replication

To enable logical replication in PostgreSQL, configure the database settings and set up the necessary objects.

Set the WAL level to logical

The Write-Ahead Log (WAL) must be configured for logical replication. This setting typically requires a database restart.

Check the current wal_level setting:
SQL
```
SHOW wal_level;
```
If the value is not logical, set wal_level = logical in the server configuration and restart the PostgreSQL service.

Create a replication user

Create a dedicated user for Databricks ingestion with replication privileges:

SQL
CREATE USER databricks_replication WITH PASSWORD 'your_secure_password';
GRANT CONNECT ON DATABASE your_database TO databricks_replication;
GRANT USAGE ON SCHEMA schema_name TO databricks_replication;
GRANT SELECT ON TABLE schema_name.table_name TO databricks_replication;
ALTER USER databricks_replication WITH REPLICATION;

For detailed privilege requirements, see PostgreSQL database user requirements.

After creating the replication user, switch to that user for the remaining configuration steps:

SQL
SET ROLE databricks_replication;

Set replica identity for tables

For each table you want to replicate, configure the replica identity. The correct setting depends on the table structure:

Table structure	Required REPLICA IDENTITY	Command
Table has primary key, and does not contain TOASTable columns (for example, `TEXT`, `BYTEA`, `VARCHAR(n)` with large values)	`DEFAULT`	SQL `ALTER TABLE schema_name.table_name REPLICA IDENTITY DEFAULT;`
Table has primary key, but includes large variable-length (TOASTable) columns	`FULL`	SQL `ALTER TABLE schema_name.table_name REPLICA IDENTITY FULL;`
Table does not have a primary key	`FULL`	SQL `ALTER TABLE schema_name.table_name REPLICA IDENTITY FULL;`

For more information about replica identity settings, see Replica Identity in the PostgreSQL documentation.

Create a publication

Create a publication in each database that includes the tables you want to replicate:

SQL
-- Create a publication for specific tables
CREATE PUBLICATION databricks_publication FOR TABLE schema_name.table1, schema_name.table2;

-- Or create a publication for all tables in a database
CREATE PUBLICATION databricks_publication FOR ALL TABLES;

note

You must create a separate publication in each PostgreSQL database that you want to replicate.

Configure replication slot parameters

Before creating replication slots, configure the following server parameters:

Limit WAL retention for replication slots

Parameter: max_slot_wal_keep_size

It is recommended not to set max_slot_wal_keep_size to -1 (the default value), as this allows unbounded WAL bloat due to retention by lagging or inactive replication slots. Depending on your workload, set this parameter to a finite value.

Learn more about max_slot_wal_keep_size parameter in the official PostgreSQL documentation.

note

Some managed cloud providers do not allow modification of this parameter and instead rely on built-in slot monitoring and auto-cleanup. Review the platform behavior before setting operational alerts.

For more information, see:

Configure replication slot capacity

Parameter: max_replication_slots

Each PostgreSQL database being replicated requires one logical replication slot. Set this parameter to at least the number of databases being replicated, plus any existing replication needs.

Configure WAL senders

Parameter: max_wal_senders

This parameter defines the maximum number of concurrent WAL sender processes that stream WAL data to subscribers. In most cases, you should have one WAL sender process for each replication slot to ensure efficient and consistent data replication.

Configure max_wal_senders to be at least equal to the number of replication slots in use, accounting for any other existing usage. It is recommended to set it slightly higher to provide operational flexibility.

Create a replication slot

Create a replication slot in each database that the Databricks ingestion gateway will use to track changes:

SQL
-- Create a replication slot with the pgoutput plugin
SELECT pg_create_logical_replication_slot('databricks_slot', 'pgoutput');

important

Replication slots hold WAL data until consumed by the connector. Configure the max_slot_wal_keep_size parameter to limit WAL retention and prevent unbounded WAL growth. See Configure replication slot parameters for details.
When you delete an ingestion pipeline, you must manually drop the associated replication slot. See Clean up replication slots.

Optional: Configure inline DDL tracking

Inline DDL tracking is an optional feature that allows the connector to automatically detect and apply schema changes from the source database. This feature is disabled by default.

warning

Inline DDL tracking is currently in preview and requires contacting Databricks Support to enable it for your workspace.

For information about which schema changes are handled automatically and which require a full refresh, see How do managed connectors handle schema evolution? and Schema evolution.

Set up inline DDL tracking

If inline DDL tracking has been enabled for your workspace, complete these steps in each PostgreSQL database:

Download and run the lakeflow_pg_ddl_change_tracking.sql script:
SQL
```
\i lakeflow_pg_ddl_change_tracking.sql
```

Verify that the triggers and audit table were created successfully:

SQL
-- Check for the DDL audit table
SELECT * FROM pg_tables WHERE tablename = 'lakeflow_ddl_audit';

-- Check for the event triggers
SELECT * FROM pg_event_trigger WHERE evtname LIKE 'lakeflow%';

Add the DDL audit table to your publication:

SQL
ALTER PUBLICATION databricks_publication ADD TABLE public.lakeflow_ddl_audit;

Cloud-specific configuration notes

AWS RDS and Aurora

Ensure that the rds.logical_replication parameter is set to 1 in the parameter group.
Configure security groups to allow connections from the Databricks workspace.
The replication user requires the rds_replication role:
SQL
```
GRANT rds_replication TO databricks_replication;
```

Azure Database for PostgreSQL

Enable logical replication in the server parameters through the Azure portal or CLI.
Configure firewall rules to allow connections from the Databricks workspace.
For Flexible Server, logical replication is supported. For Single Server, ensure you're using a supported tier.

GCP Cloud SQL for PostgreSQL

Enable the cloudsql.logical_decoding flag in the instance settings.
Configure authorized networks to allow connections from the Databricks workspace.
Ensure that the cloudsql.enable_pglogical flag is set to on if using pglogical extensions.

Verify the configuration

After completing the setup tasks, verify that logical replication is properly configured:

Check that the wal_level is set to logical:
SQL
```
SHOW wal_level;
```

Verify that the replication user has the replication privilege:

SQL
SELECT rolname, rolreplication FROM pg_roles WHERE rolname = 'databricks_replication';

Verify that the replication user has SELECT privileges on tables:

SQL
SELECT has_table_privilege('databricks_replication', 'schema.table', 'SELECT');

Confirm that the publication exists:

SQL
SELECT * FROM pg_publication WHERE pubname = 'databricks_publication';

Verify that the replication slot exists:

SQL
SELECT slot_name, slot_type, active, restart_lsn
FROM pg_replication_slots
WHERE slot_name = 'databricks_slot';

Check the replica identity for your tables:

SQL
SELECT schemaname, tablename, relreplident
FROM pg_tables t
JOIN pg_class c ON t.tablename = c.relname
WHERE schemaname = 'your_schema';

The relreplident column should show f for FULL replica identity.

Next steps

After completing the source setup, you can create an ingestion gateway and pipeline to ingest data from PostgreSQL. See Ingest data from PostgreSQL.

Logical replication for change data capture​

Overview of source setup tasks​

Configure logical replication​

Set the WAL level to logical​

Create a replication user​

Set replica identity for tables​

Create a publication​

Configure replication slot parameters​

Limit WAL retention for replication slots​

Configure replication slot capacity​

Configure WAL senders​

Create a replication slot​

Optional: Configure inline DDL tracking​

Set up inline DDL tracking​

Cloud-specific configuration notes​

AWS RDS and Aurora​

Azure Database for PostgreSQL​

GCP Cloud SQL for PostgreSQL​

Verify the configuration​

Next steps​

Logical replication for change data capture

Overview of source setup tasks

Configure logical replication

Set the WAL level to logical

Create a replication user

Set replica identity for tables

Create a publication

Configure replication slot parameters

Limit WAL retention for replication slots

Configure replication slot capacity

Configure WAL senders

Create a replication slot

Optional: Configure inline DDL tracking

Set up inline DDL tracking

Cloud-specific configuration notes

AWS RDS and Aurora

Azure Database for PostgreSQL

GCP Cloud SQL for PostgreSQL

Verify the configuration

Next steps