PostgreSQL connector limitations
Preview
The PostgreSQL connector for Lakeflow Connect is in Public Preview. Reach out to your Databricks account team to enroll in the Public Preview.
This page lists the limitations and considerations for PostgreSQL ingestion using Databricks Lakeflow Connect.
General database connector limitations
The limitations in this section apply to all database connectors in Lakeflow Connect. Keep reading for connector-specific limitations.
- When you run a scheduled pipeline, alerts don't trigger immediately. Instead, they trigger when the next update runs.
- When a source table is deleted, the destination table is not automatically deleted. You must delete the destination table manually. This behavior is not consistent with Lakeflow Spark Declarative Pipelines behavior.
- During source maintenance periods, Databricks might not be able to access your data.
- If a source table name conflicts with an existing destination table name, the pipeline update fails.
- Multi-destination pipeline support is API-only.
- You can optionally rename a table that you ingest. If you rename a table in your pipeline, it becomes an API-only pipeline, and you can no longer edit the pipeline in the UI.
- If you select a column after a pipeline has already started, the connector does not automatically backfill data for the new column. To ingest historical data, manually run a full refresh on the table.
- Databricks can't ingest two or more tables with the same name in the same pipeline, even if they come from different source schemas.
- The source system assumes that the cursor columns are monotonically increasing.
- With SCD type 1 enabled, deletes don't produce an explicit
deleteevent in the change data feed. For auditable deletions, use SCD type 2 if the connector supports it. For details, see Example: SCD type 1 and SCD type 2 processing with CDF source data.
- The connector only supports replication from primary PostgreSQL instances.
Authentication
- The connector only supports username and password authentication.
Database variations
- The connector supports PostgreSQL 13 or above.
- The connector supports AWS RDS PostgreSQL, Aurora PostgreSQL, Amazon EC2, Azure Database for PostgreSQL, Azure virtual machines, and GCP Cloud SQL for PostgreSQL. The connector also supports on-premises PostgreSQL using Azure ExpressRoute, AWS Direct Connect, and VPN networking.
Pipelines
- Each ingestion pipeline must be associated with exactly one ingestion gateway.
- Although the ingestion pipeline runs on serverless compute, the ingestion gateway must run on classic compute.
- The ingestion gateway must run in continuous mode to prevent Write-Ahead Log (WAL) bloat and replication slot accumulation.
Replication
- Logical replication requires PostgreSQL 13 or above with the
wal_levelparameter set tological.
- Each table that you replicate must have its replica identity set to
FULLorDEFAULT. Databricks recommends usingFULLreplica identity for tables without primary keys or with TOASTable columns. - Replication slots are not automatically removed when you delete pipelines. You must manually clean up replication slots to prevent WAL accumulation. See Clean up replication slots.
Schema evolution
The connector automatically handles new and deleted columns.
- When a new column appears in the source, Databricks automatically ingests it on the next pipeline run.
- When a column is deleted from the source, Databricks does not delete it automatically. Instead, the connector uses a table property to set the deleted column to
inactivein the destination. If another column later appears that has a conflicting name with theinactivecolumn, the pipeline fails. In this case, run a full refresh of the table or manually drop the inactive column.
The connector can handle the below-mentioned DDLs (for example, column renames) but requires a full refresh of the target tables.
DDL's requiring full refresh
- Changing the data type of a column
- Renaming a column
- Changing the primary key of a table
- Converting a table from unlogged to logged or vice versa
- Adding or removing partitions (for partitioned tables)
Staging
The staging catalog cannot be a foreign catalog.
Tables
- Databricks recommends ingesting 250 or fewer tables per pipeline. However, there is no limit on the number of rows or columns that are supported within these tables.
- Databricks cannot ingest two tables whose names differ only in case (for example,
mytableandMyTable) using one pipeline. To support such cases, create two gateway-ingestion pipeline pairs that publish to different target schemas. - The
source_catalog,source_schema, andsource_tablenames are case-sensitive for the database name but case-insensitive for the schema and table names (following PostgreSQL default behavior). For example, if the source database is namedMyDatabase, you must specify it asMyDatabasein theingestion_definition. - Although you can ingest from multiple source databases or schemas in one pipeline, you cannot ingest two tables of the same name. For example, you cannot ingest both
schema1.ordersandschema2.ordersin the same pipeline. - You can include multiple table or schema specifications in the
objectsfield of theingestion_definition. However, the source table names in different source schemas cannot overlap. Overlapping names result in ingestion pipeline failure.
Data types
- User-defined types and third-party extension types are ingested as strings.
- The
TIMEandTIMETZdata types are ingested as strings. - Any PostgreSQL built-in data type not listed in the Automatic data transformations table is ingested as string.
- For numeric datatype: NaN is converted to null.
- For date and timestamp: We don't support BC dates or dates after 9999AD.
- Infinity is not supported for date, timestamp, or interval.
Partitioned tables
- PostgreSQL partitioned tables are supported.
- Each partition is treated as a separate table for replication purposes.
- Adding or removing partitions requires a full refresh of the table.
Limitations for specific PostgreSQL variations
Amazon RDS and Aurora
- The
rds.logical_replicationparameter must be set to1.
Azure Database for PostgreSQL
- Logical replication must be enabled in the server parameters.
- For Single Server deployments, logical replication is only available in the General Purpose and Memory Optimized tiers.
- For Flexible Server deployments, logical replication is supported on all tiers.
- The m
ax_wal_slot_keep_size parameteris read-only, fixed at -1 (infinite), and cannot be configured.
Google Cloud SQL for PostgreSQL
- The
cloudsql.logical_decodingflag must be enabled. - Cloud SQL does not allow configuring max_wal_slot_keep_size; it is fixed at -1 (infinite) by default.