Configure PostgreSQL for ingestion into Databricks
The PostgreSQL connector for Lakeflow Connect is in Public Preview. Reach out to your Databricks account team to enroll in the Public Preview.
This page describes the source setup tasks for ingestion from PostgreSQL into Databricks using Lakeflow Connect.
Logical replication for change data capture
The PostgreSQL connector uses logical replication to track changes in the source tables. Logical replication allows the connector to capture data modifications (inserts, updates, and deletes) without requiring triggers or significant overhead on the source database.
Lakeflow PostgreSQL Logical replication requires the following:
-
Lakeflow Connect supports data replication from PostgreSQL version 13 and later.
-
Configure the database for logical replication:
The PostgreSQL parameter
wal_levelmust be set tological. -
Create publications that include all tables you want to replicate.
-
Create replication slots for each catalog that will be replicated.
Publications must be created before creating replication slots.
For more information about logical replication, see the Logical Replication documentation on the PostgreSQL website.
Overview of source setup tasks
Complete the following tasks in PostgreSQL before ingesting data into Databricks:
-
Verify PostgreSQL 13 or above
-
Configure network access (security groups, firewall rules, or VPN)
-
Configure logical replication:
- Enable logical replication (
wal_level = logical)
- Create replication user with required privileges. See PostgreSQL database user requirements
- Set replica identity for tables. See Set replica identity for tables
- Creating publications and replication slots
- Enable logical replication (
-
Optional: Configure inline DDL tracking for automatic schema change detection. In case you want to opt for the inline DDL tracking, reach out to Databricks support.
If you plan to replicate from multiple PostgreSQL databases, you must create a separate publication and replication slot for each database. The inline DDL tracking script (if used) must also be executed in each database.
Configure logical replication
To enable logical replication in PostgreSQL, configure the database settings and set up the necessary objects.
Set the WAL level to logical
The Write-Ahead Log (WAL) must be configured for logical replication. This setting typically requires a database restart.
-
Check the current
wal_levelsetting:SQLSHOW wal_level; -
If the value is not
logical, setwal_level = logicalin the server configuration and restart the PostgreSQL service.
Create a replication user
Create a dedicated user for Databricks ingestion with replication privileges:
CREATE USER databricks_replication WITH PASSWORD 'your_secure_password';
GRANT CONNECT ON DATABASE your_database TO databricks_replication;
ALTER USER databricks_replication WITH REPLICATION;
For detailed privilege requirements, see PostgreSQL database user requirements.
Set replica identity for tables
For each table you want to replicate, configure the replica identity. The correct setting depends on the table structure:
Table structure | Required REPLICA IDENTITY | Command |
|---|---|---|
Table has primary key, and does not contain TOASTable columns (for example, |
| SQL |
Table has primary key, but includes large variable-length (TOASTable) columns |
| SQL |
Table does not have a primary key |
| SQL |
For more information about replica identity settings, see Replica Identity in the PostgreSQL documentation.
Create a publication
Create a publication in each database that includes the tables you want to replicate:
-- Create a publication for specific tables
CREATE PUBLICATION databricks_publication FOR TABLE schema_name.table1, schema_name.table2;
-- Or create a publication for all tables in a database
CREATE PUBLICATION databricks_publication FOR ALL TABLES;
You must create a separate publication in each PostgreSQL database that you want to replicate.
Configure replication slot parameters
Before creating replication slots, configure the following server parameters:
Limit WAL retention for replication slots
Parameter: max_slot_wal_keep_size
It is recommended not to set max_slot_wal_keep_size to -1 (the default value), as this allows unbounded WAL bloat due to retention by lagging or inactive replication slots. Depending on your workload, set this parameter to a finite value.
Learn more about max_slot_wal_keep_size parameter in the official PostgreSQL documentation.
Some managed cloud providers do not allow modification of this parameter and instead rely on built-in slot monitoring and auto-cleanup. Review the platform behavior before setting operational alerts.
For more information, see:
Configure replication slot capacity
Parameter: max_replication_slots
Each PostgreSQL database being replicated requires one logical replication slot. Set this parameter to at least the number of databases being replicated, plus any existing replication needs.
Configure WAL senders
Parameter: max_wal_senders
This parameter defines the maximum number of concurrent WAL sender processes that stream WAL data to subscribers. In most cases, you should have one WAL sender process for each replication slot to ensure efficient and consistent data replication.
Configure max_wal_senders to be at least equal to the number of replication slots in use, accounting for any other existing usage. It is recommended to set it slightly higher to provide operational flexibility.
Create a replication slot
Create a replication slot in each database that the Databricks ingestion gateway will use to track changes:
-- Create a replication slot with the pgoutput plugin
SELECT pg_create_logical_replication_slot('databricks_slot', 'pgoutput');
- Replication slots hold WAL data until consumed by the connector. Configure the
max_slot_wal_keep_sizeparameter to limit WAL retention and prevent unbounded WAL growth. See Configure replication slot parameters for details. - When you delete an ingestion pipeline, you must manually drop the associated replication slot. See Clean up replication slots.
Optional: Configure inline DDL tracking
Inline DDL tracking is an optional feature that allows the connector to automatically detect and apply schema changes from the source database. This feature is disabled by default.
Inline DDL tracking is currently in preview and requires contacting Databricks Support to enable it for your workspace.
For information about which schema changes are handled automatically and which require a full refresh, see How do managed connectors handle schema evolution? and Schema evolution.
Set up inline DDL tracking
If inline DDL tracking has been enabled for your workspace, complete these steps in each PostgreSQL database:
-
Download and run the lakeflow_pg_ddl_change_tracking.sql script:
SQL\i lakeflow_pg_ddl_change_tracking.sql -
Verify that the triggers and audit table were created successfully:
SQL-- Check for the DDL audit table
SELECT * FROM pg_tables WHERE tablename = 'lakeflow_ddl_audit';
-- Check for the event triggers
SELECT * FROM pg_event_trigger WHERE evtname LIKE 'lakeflow%'; -
Add the DDL audit table to your publication:
SQLALTER PUBLICATION databricks_publication ADD TABLE public.lakeflow_ddl_audit;
Cloud-specific configuration notes
AWS RDS and Aurora
-
Ensure that the
rds.logical_replicationparameter is set to1in the parameter group. -
Configure security groups to allow connections from the Databricks workspace.
-
The replication user requires the
rds_replicationrole:SQLGRANT rds_replication TO databricks_replication;
Azure Database for PostgreSQL
- Enable logical replication in the server parameters through the Azure portal or CLI.
- Configure firewall rules to allow connections from the Databricks workspace.
- For Flexible Server, logical replication is supported. For Single Server, ensure you're using a supported tier.
GCP Cloud SQL for PostgreSQL
- Enable the
cloudsql.logical_decodingflag in the instance settings. - Configure authorized networks to allow connections from the Databricks workspace.
- Ensure that the
cloudsql.enable_pglogicalflag is set toonif using pglogical extensions.
Verify the configuration
After completing the setup tasks, verify that logical replication is properly configured:
-
Check that the
wal_levelis set tological:SQLSHOW wal_level; -
Verify that the replication user has the
replicationprivilege:SQLSELECT rolname, rolreplication FROM pg_roles WHERE rolname = 'databricks_replication'; -
Confirm that the publication exists:
SQLSELECT * FROM pg_publication WHERE pubname = 'databricks_publication'; -
Verify that the replication slot exists:
SQLSELECT slot_name, slot_type, active, restart_lsn
FROM pg_replication_slots
WHERE slot_name = 'databricks_slot'; -
Check the replica identity for your tables:
SQLSELECT schemaname, tablename, relreplident
FROM pg_tables t
JOIN pg_class c ON t.tablename = c.relname
WHERE schemaname = 'your_schema';The
relreplidentcolumn should showffor FULL replica identity.
Next steps
After completing the source setup, you can create an ingestion gateway and pipeline to ingest data from PostgreSQL. See Ingest data from PostgreSQL.