Configure SQL Server for ingestion into Databricks

Preview

LakeFlow Connect is in gated Public Preview. To participate in the preview, contact your Databricks account team.

This article provides an overview of the prerequisite source setup decisions and tasks required for ingestion from SQL Server into Databricks using LakeFlow Connect.

Change tracking vs. change data capture

Databricks requires either Microsoft change tracking or Microsoft change data capture (CDC) to extract data from SQL Server.

  • Change tracking captures the fact that rows in a table have changed, but doesn’t capture the actual operations.

  • Change data capture captures every operation on a table.

The following table provides considerations to help you choose between change tracking and CDC:

Change tracking

Change data capture

Captures the fact that rows in a table have changed, but doesn’t capture the actual changes or the number of changes that were made.

Captures the fact that rows in a table were changed and the actual changes.

Low CPU and storage overhead.

High CPU and storage overhead.

If a table has a primary key, Databricks recommends using CT for optimal performance.

When no primary key is present, CDC must be used.

TRUNCATE DDLs are not blocked.

Blocks TRUNCATE DDLs.

Definitive operations are not captured. For example, a delete > insert > update sequence might be recorded as an insert.

Captures each insert, update, and delete operation with the full image.

No native support for DDL. Our DDL support occasionally requires reinitialization of a table.

Limited native support for DDL, but will still require reinitialization of tables in some cases.

For more information about these options, see Track data changes (SQL Server) in the SQL Server documentation.

Overview of source setup tasks

You must complete the following tasks in SQL Server before ingesting data into Databricks:

  • Verify that you meet the SQL Server version requirements:

    • To use change tracking, you must have SQL Server 2012 or above.

    • To use CDC, you must have SQL Server 2012 or above. Versions earlier than SQL Server 2016 additionally require Enterprise edition.

  • Create a database user in SQL Server that is dedicated to Databricks ingestion and meets the privilege requirements.

  • Enable change tracking or built-in CDC: