Skip to main content

Query-based connectors

Preview

This feature is in Public Preview.

Query-based connectors in Lakeflow Connect ingest data from databases by querying the source directly, without requiring change data capture (CDC) configuration. Instead of relying on binlogs or CDC infrastructure, they use a cursor column—a monotonically increasing timestamp or integer column—to track which rows are new or updated since the last pipeline run.

Query-based connectors use Unity Catalog connections and Lakehouse Federation to connect to source databases, and they write results to streaming tables.

How it works

On each pipeline run, a query-based connector queries the source database and retrieves all rows with a cursor column value that is greater than the value recorded from the previous run. The connector stores the high-water mark of the cursor column after each successful run and uses it as the lower bound on the next run.

Because the connector queries the source directly, it doesn't require an ingestion gateway or a staging volume. The pipeline runs on a schedule you define, not continuously.

Query-based connectors compared to CDC database connectors

Query-based connectors differ from CDC database connectors in the following ways:

  • No ingestion gateway: CDC connectors require a gateway to capture binlog events. Query-based connectors don't use a gateway.
  • No staging volume: CDC connectors buffer extracted data in a staging volume. Query-based connectors write directly from the source query to the destination table.
  • Scheduled instead of continuous: Query-based connectors run on a schedule. They don't capture every intermediate row state between runs. They only capture the latest state of the rows that have changed.
  • Broader source compatibility: Any database with a suitable cursor column is a valid source, even if it doesn't support CDC or binlog access.

The trade-off is that query performance can be slower and queries run directly on source tables, which can place more load on the source database compared to CDC connectors that query the binlog. Soft-deletion tracking is supported using deletion_condition. Hard-deletion tracking is also supported in Beta. Both require API configuration.

Supported ingestion approaches

Query-based connectors support multiple ingestion approaches. The approach you use determines which configuration parameters are required.

Approach

How it connects

Required parameters

Foreign connection ingestion

Uses a connection that stores authentication credentials for the source database. The connector uses the connection to query the source database directly.

connection_name, source_catalog, source_schema, source_table, cursor_column

Foreign catalog ingestion

Uses a foreign catalog backed by a Lakehouse Federation data source. The connector uses the foreign catalog to read source data instead of connecting to the source database directly.

ingest_from_uc_foreign_catalog: true, cursor_columns, primary_keys (required unless you use APPEND_ONLY mode)

Supported sources

The following database sources are supported.

Foreign connection ingestion sources:

  • Oracle
  • Teradata
  • SQL Server
  • MySQL
  • MariaDB
  • PostgreSQL

Foreign catalog ingestion sources:

All Lakehouse Federation data sources are supported using foreign catalog ingestion. For the full list, see Lakehouse Federation.

Supported interfaces

You can use the Databricks UI or Declarative Automation Bundles to create query-based pipelines.

Compute requirements

Query-based ingestion pipelines run on serverless compute by default. Classic compute is supported in Beta, but only using APIs. Databricks recommends using serverless compute.

To use query-based connectors with serverless compute, your compute environment must allow network connectivity to the source database. See Networking and Networking recommendations for Lakehouse Federation.

History tracking (SCD) modes

Query-based connectors support the following history tracking modes—also known as slowly changing dimension (SCD) modes—for destination tables:

  • SCD_TYPE_1: Overwrites the existing row in the destination table with the latest source row. No history is preserved.
  • SCD_TYPE_2: Preserves the full history of row changes by adding new rows with version metadata. See Enable history tracking (SCD type 2).
  • APPEND_ONLY: Appends every ingested row to the destination table without merging or overwriting.

Schema evolution

Query-based connectors handle schema evolution the same way as other managed connectors in Lakeflow Connect. See How do managed connectors handle schema evolution?.