SQL Server connector FAQs

This page answers frequently asked questions about the SQL Server connector in Databricks Lakeflow Connect.

General managed connector FAQs

The answers in Managed connector FAQs apply to all managed connectors in Lakeflow Connect. Keep reading for connector-specific FAQs.

How does Databricks connect to SQL Server?

Databricks connects to SQL Server using transport layer security (TLS) and a JDBC connection. Credentials are stored securely in Unity Catalog and can only be retrieved if the user running the ingestion flow has appropriate permissions. Databricks recommends creating a separate user in SQL Server for ingesting data. If there are databases or tables you do not want to expose to this user, you can use built-in SQL Server permissions.

If the pipeline fails, does ingestion resume without data loss?

Yes. Databricks keeps track of what the connector has extracted from the source and applied in the destination. If anything happens, Databricks can resume at that point as long as the logs remain on the source database. This can be impacted if the pipeline does not run before the log retention period deletes the logs, requiring a full refresh on the target tables.

Which SQL Server variations does the connector support?

The connector supports Azure SQL and AWS RDS SQL databases. This includes SQL Server running on Azure virtual machines (VMs) and Amazon EC2. The connector also supports SQL Server on-premises using Azure ExpressRoute, AWS Direct Connect, and VPN if sufficient bandwidth is available.

How does the connector incrementally pull data?

The connector uses Microsoft Change Tracking and Microsoft Change Data Capture (CDC), where available. If a table has a primary key, Databricks recommends using change tracking for optimal performance. If a table doesn't have a primary key, Databricks recommends using CDC. If change tracking and CDC are both enabled, the connector uses change tracking.

Does the connector capture timezones for date and time columns?

No. Date and time are ingested in UTC format.

Can I customize the schedule of the ingestion gateway?

No the ingestion gateway must run in continuous mode to avoid changes being dropped due to log retention. If changes have been dropped, a full refresh is required for all tables.

How does the connector handle a table without a primary key?

The connector treats all columns except large objects as a bundled primary key. If there are duplicate rows in the source table, these rows are ingested as a single row in the destination table.

How often can I schedule the ingestion pipeline to run?

There is no limit on how often the ingestion pipeline can be scheduled to run. However, Databricks recommends at least 5 minutes between intervals because it takes some time for the serverless compute to startup. Databricks doesn't support running the ingestion pipeline in continuous mode.

Why am I not seeing all of the rows from my database in the initial pipeline run?

The ingestion gateway extracts historical and CDC data as soon as it starts running. The ingestion pipeline might run before all of this data has been extracted, resulting in a partial application of data into target tables. It can take a few runs of the ingestion pipeline to have all of the data extracted and applied to target tables.

Can I ingest from a read replica or a secondary instance?

No. Support is limited to primary SQL Server instances. This is because change tracking and change data capture are not supported on read replicas or secondary instances.

General managed connector FAQs​

How does Databricks connect to SQL Server?​

If the pipeline fails, does ingestion resume without data loss?​

Which SQL Server variations does the connector support?​

How does the connector incrementally pull data?​

Does the connector capture timezones for date and time columns?​

Can I customize the schedule of the ingestion gateway?​

How does the connector handle a table without a primary key?​

How often can I schedule the ingestion pipeline to run?​

Why am I not seeing all of the rows from my database in the initial pipeline run?​

Can I ingest from a read replica or a secondary instance?​