Ingest data from SQL Server
Learn how to ingest data from SQL Server into Databricks using Lakeflow Connect.
The SQL Server connector supports Azure SQL Database, Azure SQL Managed Instance, and Amazon RDS SQL databases. This includes SQL Server running on Azure virtual machines (VMs) and Amazon EC2. The connector also supports SQL Server on-premises using Azure ExpressRoute and AWS Direct Connect networking.
Requirements
-
To create an ingestion gateway and an ingestion pipeline, you must first meet the following requirements:
-
Your workspace is enabled for Unity Catalog.
-
Serverless compute is enabled for your workspace. See Serverless compute requirements.
-
If you plan to create a connection: You have
CREATE CONNECTIONprivileges on the metastore. See Manage privileges in Unity Catalog.If your connector supports UI-based pipeline authoring, you can create the connection and the pipeline at the same time by completing the steps on this page. However, if you use API-based pipeline authoring, you must create the connection in Catalog Explorer before you complete the steps on this page. See Connect to managed ingestion sources.
-
If you plan to use an existing connection: You have
USE CONNECTIONprivileges orALL PRIVILEGESon the connection. -
You have
USE CATALOGprivileges on the target catalog. -
You have
USE SCHEMA,CREATE TABLE, andCREATE VOLUMEprivileges on an existing schema orCREATE SCHEMAprivileges on the target catalog.
-
You have access to a primary SQL Server instance. Change tracking and change data capture features are not supported on read replicas or secondary instances.
-
Unrestricted permissions to create clusters, or a custom policy (API only). A custom policy for the gateway must meet the following requirements:
-
Family: Job Compute
-
Policy family overrides:
{
"cluster_type": {
"type": "fixed",
"value": "dlt"
},
"num_workers": {
"type": "unlimited",
"defaultValue": 1,
"isOptional": true
},
"runtime_engine": {
"type": "fixed",
"value": "STANDARD",
"hidden": true
}
} -
Databricks recommends specifying the smallest possible worker nodes for ingestion gateways because they do not impact gateway performance. The following compute policy enables Databricks to scale the ingestion gateway to meet the needs of your workload. The minimum requirement is 8 cores to enable efficient and performant data extraction from your source database.
Python{
"driver_node_type_id": {
"type": "fixed",
"value": "r5n.16xlarge"
},
"node_type_id": {
"type": "fixed",
"value": "m5n.large"
}
}
For more information about cluster policies, see Select a compute policy.
-
-
-
To ingest from SQL Server, you must first complete the steps in Configure Microsoft SQL Server for ingestion into Databricks.
Create a gateway and an ingestion pipeline
- Databricks UI
- Databricks Asset Bundles
- Databricks notebook
- Terraform
-
In the sidebar of the Databricks workspace, click Data Ingestion.
-
On the Add data page, under Databricks connectors, click SQL Server.
-
On the Connection page of the ingestion wizard, select the connection that stores SQL Server access credentials from Configure Microsoft SQL Server for ingestion into Databricks. If you have the
CREATE CONNECTIONprivilege on the metastore, you can clickCreate connection to create a new connection with the authentication details in SQL Server.
-
Click Next.
-
On the Ingestion setup page, enter a unique name for the ingestion pipeline. This pipeline moves data from the staging location to the destination.
-
Select a catalog and a schema to write event logs to. The event log contains audit logs, data quality checks, pipeline progress, and errors. If you have
USE CATALOGandCREATE SCHEMAprivileges on the catalog, you can clickCreate schema in the drop-down menu to create a new schema.
-
(Optional) Set Auto full refresh for all tables to On. When auto refresh is on, the pipeline automatically tries to fix issues like log cleanup events and certain types of schema evolution by fully refreshing the impacted table. If history tracking is enabled, a full refresh erases that history.
-
Enter a unique name for the ingestion gateway. The gateway is a pipeline that extracts changes from the source and stages them for the ingestion pipeline to load.
-
Select a catalog and a schema for the Staging location. A volume is created in this location to stage extracted data. If you have
USE CATALOGandCREATE SCHEMAprivileges on the catalog, you can clickCreate schema in the drop-down menu to create a new schema.
-
Click Create pipeline and continue.
-
On the Source page, select the tables to ingest. If you select specific tables, you can configure table settings:
a. (Optional) On the Settings tab, specify a Destination name for each ingested table. This is useful to differentiate between destination tables when you ingest an object into the same schema multiple times. See Name a destination table.
a. (Optional) Change the default History tracking setting. See Enable history tracking (SCD type 2).
-
Click Next, then click Save and continue.
-
On the Destination page, select a catalog and a schema to load data into. If you have
USE CATALOGandCREATE SCHEMAprivileges on the catalog, you can clickCreate schema in the drop-down menu to create a new schema.
-
Click Save and continue.
-
On the Database setup page, click Validate to confirm that your source is properly configured for Databricks ingestion. Any missing configurations are returned. For steps to resolve, click Complete configuration. Then click Next. Alternatively, click Skip validation.
-
(Optional) On the Schedules and notifications page, click
Create schedule. Set the frequency to refresh the destination tables.
-
(Optional) Click
Add notification to set email notifications for pipeline operation success or failure, then click Save and run pipeline.
Before you ingest using Declarative Automation Bundles, you must have access to an existing connection. For instructions, see Connect to managed ingestion sources.
The staging catalog and schema can be the same as the destination catalog and schema. The staging catalog can't be a foreign catalog. Specify the staging location in the gateway_definition section of your bundle pipeline YAML file.
The ingestion gateway extracts snapshot and change data from the source database and stores it in the Unity Catalog staging volume. You must run the gateway as a continuous pipeline. This helps to accommodate any change log retention policies that you have on the source database.
The ingestion pipeline applies the snapshot and change data from the staging volume into destination streaming tables.
Bundles can contain YAML definitions of jobs and tasks, are managed using the Databricks CLI, and can be shared and run in different target workspaces (such as development, staging, and production). For more information, see What are Declarative Automation Bundles?.
-
Create a new bundle using the Databricks CLI:
Bashdatabricks bundle init -
Add two new resource files to the bundle:
- A pipeline definition file (for example,
resources/sqlserver_pipeline.yml). See pipeline.ingestion_definition and Examples. - A job definition file that controls the frequency of data ingestion (for example,
resources/sqlserver_job.yml).
- A pipeline definition file (for example,
-
Deploy the pipeline using the Databricks CLI:
Bashdatabricks bundle deploy
Update the Configuration cell in the following notebook with the source connection, target catalog, target schema, and tables to ingest from the source.
You can use Terraform to deploy and manage SQL Server ingestion pipelines. For a complete example framework, including Terraform configurations for creating gateways and ingestion pipelines, see the Lakeflow Connect Terraform examples repository on GitHub.
Verify successful data ingestion
The list view on the pipeline details page shows the number of records processed as data is ingested. These numbers refresh automatically.

The Upserted records and Deleted records columns are not shown by default. You can enable them by clicking on the columns configuration button and selecting them.
Examples
Use these examples to configure your pipeline.
Pipeline configuration
- Databricks Asset Bundles
- Databricks notebook
The following pipeline definition file:
variables:
# Common variables used multiple places in the DAB definition.
gateway_name:
default: sqlserver-gateway
dest_catalog:
default: main
dest_schema:
default: ingest-destination-schema
resources:
pipelines:
gateway:
name: ${var.gateway_name}
gateway_definition:
connection_name: <sqlserver-connection>
gateway_storage_catalog: main
gateway_storage_schema: ${var.dest_schema}
gateway_storage_name: ${var.gateway_name}
target: ${var.dest_schema}
catalog: ${var.dest_catalog}
pipeline_sqlserver:
name: sqlserver-ingestion-pipeline
ingestion_definition:
ingestion_gateway_id: ${resources.pipelines.gateway.id}
objects:
# Modify this with your tables!
- table:
# Ingest the table test.ingestion_demo_lineitem to dest_catalog.dest_schema.ingestion_demo_line_item.
source_catalog: test
source_schema: ingestion_demo
source_table: lineitem
destination_catalog: ${var.dest_catalog}
destination_schema: ${var.dest_schema}
- schema:
# Ingest all tables in the test.ingestion_whole_schema schema to dest_catalog.dest_schema. The destination
# table name will be the same as it is on the source.
source_catalog: test
source_schema: ingestion_whole_schema
destination_catalog: ${var.dest_catalog}
destination_schema: ${var.dest_schema}
target: ${var.dest_schema}
catalog: ${var.dest_catalog}
The following is an example Configuration section of a pipeline specification:
# The name of the UC connection with the credentials to access the source database
connection_name = "my_connection"
# The name of the UC catalog and schema to store the replicated tables
target_catalog_name = "main"
target_schema_name = "lakeflow_sqlserver_connector_cdc"
# The name of the UC catalog and schema to store the staging volume with intermediate
# CDC and snapshot data. Use the destination catalog/schema by default.
stg_catalog_name = target_catalog_name
stg_schema_name = target_schema_name
# The name of the Gateway pipeline to create
gateway_pipeline_name = "cdc_gateway"
# The name of the Ingestion pipeline to create
ingestion_pipeline_name = "cdc_ingestion"
# Construct the full list of tables to replicate.
# IMPORTANT: The letter case of catalog, schema, and table names must match exactly
# the case used in the source database system tables.
tables_to_replicate = replicate_full_db_schema("MY_DB", ["MY_DB_SCHEMA"])
# Append tables from additional schemas as needed:
# + replicate_tables_from_db_schema("MY_DB", "MY_SCHEMA_2", ["table3", "table4"])
Bundle job definition file
The following is an example job definition file for use with Declarative Automation Bundles. The job runs every day, exactly one day from the last run.
resources:
jobs:
sqlserver_dab_job:
name: sqlserver_dab_job
trigger:
periodic:
interval: 1
unit: DAYS
email_notifications:
on_failure:
- <email-address>
tasks:
- task_key: refresh_pipeline
pipeline_task:
pipeline_id: ${resources.pipelines.pipeline_sqlserver.id}
Common patterns
For advanced pipeline configurations, see Common patterns for managed ingestion pipelines.
Next steps
Start, schedule, and set alerts on your pipeline. See Common pipeline maintenance tasks.