Skip to main content

PostgreSQL connector limitations

Preview

The PostgreSQL connector for Lakeflow Connect is in Public Preview. Reach out to your Databricks account team to enroll in the Public Preview.

This page lists the limitations and considerations for PostgreSQL ingestion using Databricks Lakeflow Connect.

General database connector limitations

The limitations in this section apply to all database connectors in Lakeflow Connect. Keep reading for connector-specific limitations.

  • When you run a scheduled pipeline, alerts don't trigger immediately. Instead, they trigger when the next update runs.
  • When a source table is deleted, the destination table is not automatically deleted. You must delete the destination table manually. This behavior is not consistent with Lakeflow Spark Declarative Pipelines behavior.
  • During source maintenance periods, Databricks might not be able to access your data.
  • If a source table name conflicts with an existing destination table name, the pipeline update fails.
  • Multi-destination pipeline support is API-only.
  • You can optionally rename a table that you ingest. If you rename a table in your pipeline, it becomes an API-only pipeline, and you can no longer edit the pipeline in the UI.
  • If you select a column after a pipeline has already started, the connector does not automatically backfill data for the new column. To ingest historical data, manually run a full refresh on the table.
  • Databricks can't ingest two or more tables with the same name in the same pipeline, even if they come from different source schemas.
  • The source system assumes that the cursor columns are monotonically increasing.
  • Managed ingestion pipelines aren't supported for workspaces in AWS GovCloud regions (FedRAMP High).
  • Managed ingestion pipelines aren't supported for FedRAMP Moderate workspaces in the us-east-2 or us-west-1 regions.
  • The connector only supports replication from primary PostgreSQL instances.

Authentication

  • The connector only supports username and password authentication.

Database variations

  • The connector supports PostgreSQL 13 or above.
  • The connector supports AWS RDS PostgreSQL, Aurora PostgreSQL, Amazon EC2, Azure Database for PostgreSQL, Azure virtual machines, and GCP Cloud SQL for PostgreSQL. The connector also supports on-premises PostgreSQL using Azure ExpressRoute, AWS Direct Connect, and VPN networking.

Pipelines

  • Each ingestion pipeline must be associated with exactly one ingestion gateway.
  • Although the ingestion pipeline runs on serverless compute, the ingestion gateway must run on classic compute.
  • The ingestion gateway must run in continuous mode to prevent Write-Ahead Log (WAL) bloat and replication slot accumulation.

Replication

  • For AWS RDS and Aurora, set the rds.logical_replication parameter to 1.
  • Each table that you replicate must have its replica identity set to FULL or DEFAULT. Databricks recommends using FULL replica identity for tables without primary keys or with TOASTable columns.
  • Replication slots are not automatically removed when you delete pipelines. You must manually clean up replication slots to prevent WAL accumulation. See Clean up replication slots.

Schema evolution

The connector automatically handles new and deleted columns.

  • When a new column appears in the source, Databricks automatically ingests it on the next pipeline run.
  • When a column is deleted from the source, Databricks does not delete it automatically. Instead, the connector uses a table property to set the deleted column to inactive in the destination. If another column later appears that has a conflicting name with the inactive column, the pipeline fails. In this case, run a full refresh of the table or manually drop the inactive column.

The connector can handle the below-mentioned DDLs (for example, column renames) but requires a full refresh of the target tables.

DDL's requiring full refresh

  • Changing the data type of a column
  • Renaming a column
  • Changing the primary key of a table
  • Converting a table from unlogged to logged or vice versa
  • Adding or removing partitions (for partitioned tables)

Staging

The staging catalog cannot be a foreign catalog.

Tables

  • Databricks recommends ingesting 250 or fewer tables per pipeline. However, there is no limit on the number of rows or columns that are supported within these tables.
  • Databricks cannot ingest two tables whose names differ only in case (for example, mytable and MyTable) using one pipeline. To support such cases, create two gateway-ingestion pipeline pairs that publish to different target schemas.
  • The source_catalog, source_schema, and source_table names are case-sensitive for the database name but case-insensitive for the schema and table names (following PostgreSQL default behavior). For example, if the source database is named MyDatabase, you must specify it as MyDatabase in the ingestion_definition.
  • Although you can ingest from multiple source databases or schemas in one pipeline, you cannot ingest two tables of the same name. For example, you cannot ingest both schema1.orders and schema2.orders in the same pipeline.
  • You can include multiple table or schema specifications in the objects field of the ingestion_definition. However, the source table names in different source schemas cannot overlap. Overlapping names result in ingestion pipeline failure.

Data types

  • User-defined types and third-party extension types are ingested as strings.
  • The TIME and TIMETZ data types are ingested as strings.
  • Any PostgreSQL built-in data type not listed in the Automatic data transformations table is ingested as string.
  • For numeric datatype: NaN is converted to null.
  • For date and timestamp: We don't support BC dates or dates after 9999AD.
  • Infinity is not supported for date, timestamp, or interval.

Partitioned tables

  • PostgreSQL partitioned tables are supported.
  • Each partition is treated as a separate table for replication purposes.
  • Adding or removing partitions requires a full refresh of the table.

Limitations for specific PostgreSQL variations

Amazon RDS and Aurora

  • The rds.logical_replication parameter must be set to 1.

Azure Database for PostgreSQL

  • Logical replication must be enabled in the server parameters.
  • For Single Server deployments, logical replication is only available in the General Purpose and Memory Optimized tiers.
  • For Flexible Server deployments, logical replication is supported on all tiers.
  • The max_wal_slot_keep_size parameter is read-only, fixed at -1 (infinite), and cannot be configured.

Google Cloud SQL for PostgreSQL

  • The cloudsql.logical_decoding flag must be enabled.
  • Cloud SQL does not allow configuring max_wal_slot_keep_size; it is fixed at -1 (infinite) by default.