Query-based connectors

Query-based connectors in Lakeflow Connect ingest data from databases by querying the source directly, without requiring change data capture (CDC) configuration. Instead of relying on binlogs or CDC infrastructure, they use a cursor column—a monotonically increasing timestamp or integer column—to track rows that are new or updated after the last pipeline run.

Query-based connectors use Unity Catalog connections and Lakehouse Federation to connect to source databases, and they write results to streaming tables.

How it works

On each pipeline run, a query-based connector queries the source database and retrieves all rows whose cursor column value is greater than the high-water mark from the previous run. The connector stores the high-water mark of the cursor column after each successful run and uses it as the lower bound on the next run.

Because the connector queries the source directly, it doesn't require an ingestion gateway or a staging volume. The pipeline runs on a schedule you define, not continuously.

Query-based connectors compared to CDC database connectors

Query-based connectors differ from CDC database connectors in the following ways:

No ingestion gateway: CDC connectors require a gateway to capture binlog events. Query-based connectors don't use a gateway.
No staging volume: CDC connectors buffer extracted data in a staging volume. Query-based connectors write directly from the source query to the destination table.
Scheduled instead of continuous: Query-based connectors run on a schedule. They don't capture every intermediate row state between runs. They only capture the latest state of the rows that have changed.
Broader source compatibility: Any database with a suitable cursor column is a valid source, even if it doesn't support CDC or binlog access.

The trade-off is that query performance can be slower and queries run directly on source tables, which can place more load on the source database compared to CDC connectors that query the binlog. Soft-deletion tracking is supported using deletion_condition. Hard-deletion tracking is also supported in Beta. Both require API configuration.

Supported ingestion approaches

Query-based connectors support multiple ingestion approaches. The approach you use determines which configuration parameters are required.

Approach	How it connects	Required parameters
Foreign connection ingestion	Uses a connection that stores authentication credentials for the source database. The connector uses the connection to query the source database directly.	`connection_name`, `source_catalog`, `source_schema`, `source_table`, `cursor_column`
Foreign catalog ingestion	Uses a foreign catalog backed by a Lakehouse Federation data source. The connector uses the foreign catalog to read source data instead of connecting to the source database directly.	`ingest_from_uc_foreign_catalog: true`, `cursor_columns`, `primary_keys` (required unless you use `APPEND_ONLY` mode)

Supported sources

The following database sources are supported.

Foreign connection ingestion sources:

Oracle
Teradata
SQL Server
MySQL
MariaDB
PostgreSQL

Foreign catalog ingestion sources:

All Lakehouse Federation data sources are supported using foreign catalog ingestion. For the full list, see Lakehouse Federation.

Supported interfaces

You can use the Databricks UI or Declarative Automation Bundles to create query-based pipelines.

Compute requirements

Query-based ingestion pipelines run on serverless compute by default. Classic compute deployment is supported in Beta through Declarative Automation Bundles or the API. Databricks recommends using serverless compute. See Create a query-based ingestion pipeline.

To use query-based connectors with serverless compute, your compute environment must allow network connectivity to the source database. See Networking and Networking recommendations for Lakehouse Federation.

History tracking (SCD) modes

Query-based connectors support the following history tracking modes—also known as slowly changing dimension (SCD) modes—for destination tables:

SCD_TYPE_1: Overwrites the existing row in the destination table with the latest source row. The destination table preserves no history.
SCD_TYPE_2: Preserves the full history of row changes by adding new rows with version metadata. See Enable history tracking (SCD type 2).
APPEND_ONLY: Appends every ingested row to the destination table without merging or overwriting.

Schema evolution

Query-based connectors handle schema evolution the same way as other managed connectors in Lakeflow Connect. See How do managed connectors handle schema evolution?.

How it works​

Query-based connectors compared to CDC database connectors​

Supported ingestion approaches​

Supported sources​

Supported interfaces​

Compute requirements​

History tracking (SCD) modes​

Schema evolution​