Skip to main content

Ingest data from SQL Server

Preview

The Microsoft SQL Server connector is in Public Preview.

This page describes how to ingest data from SQL Server and load it into Databricks using Lakeflow Connect. The SQL Server connector supports Azure SQL and Amazon RDS SQL databases. This includes SQL Server running on Azure virtual machines (VMs) and Amazon EC2. The connector also supports SQL Server on-premises using Azure ExpressRoute and AWS Direct Connect networking.

Overview of steps

  1. Configure your source database for ingestion.
  2. Create an ingestion gateway. The gateway connects to the SQL Server database, extracts snapshot and change data, and stores it in a Unity Catalog volume for staging.
  3. Create an ingestion pipeline. The pipeline applies snapshot and change data from the staging volume into destination streaming tables.
  4. Schedule the ingestion pipeline.

Before you begin

To create an ingestion pipeline, you must meet the following requirements:

  • Your workspace is enabled for Unity Catalog.

  • Serverless compute is enabled for notebooks, workflows, and DLT. See Enable serverless compute.

  • To create a connection: CREATE CONNECTION on the metastore.

    To use an existing connection: USE CONNECTION or ALL PRIVILEGES on the connection.

  • USE CATALOG on the target catalog.

  • USE SCHEMA, CREATE TABLE, and CREATE VOLUME on an existing schema or CREATE SCHEMA on the target catalog.

  • Unrestricted permissions to create clusters, or a custom policy. A custom policy must meet the following requirements:

    • Family: Job Compute

    • Policy family overrides:

      {
      "cluster_type": {
      "type": "fixed",
      "value": "dlt"
      },
      "num_workers": {
      "type": "unlimited",
      "defaultValue": 1,
      "isOptional": true
      },
      "runtime_engine": {
      "type": "fixed",
      "value": "STANDARD",
      "hidden": true
      }
      }
    • Databricks recommends specifying smallest possible worker nodes for ingestion gateways because they do not impact gateway performance.

      "driver_node_type_id": {
      "type": "unlimited",
      "defaultValue": "r5.xlarge",
      "isOptional": true
      },
      "node_type_id": {
      "type": "unlimited",
      "defaultValue": "m4.large",
      "isOptional": true
      }

    For more information about cluster policies, see Select a cluster policy.

Set up the source database for ingestion

See Configure Microsoft SQL Server for ingestion into Databricks.

Create a SQL Server connection

  1. In the Databricks workspace, click Catalog > External Data > Connections.
  2. Click Create connection. If you don't see this button, you don't have CREATE CONNECTION privileges.
  3. Enter a unique Connection name.
  4. For Connection type select SQL Server.
  5. For Host, specify the SQL Server domain name.
  6. For User and Password, enter your SQL Server login credentials.
  7. Click Create.

Create a staging catalog and schemas

The staging catalog and schema can be the same as the destination catalog and schema. The staging catalog can't be a foreign catalog.

  1. In the Databricks workspace, click Catalog.
  2. On the Catalog tab, do one of the following:
  3. Click Create catalog. If you don't see this button, you don't have CREATE CATALOG privileges.
  4. Enter a unique name for the catalog, and then click Create.
  5. Select the catalog you created.
  6. Click Create schema. If you don't see this button, you don't have CREATE SCHEMA privileges.
  7. Enter a unique name for the schema, and then click Create.

Create the gateway and ingestion pipeline

The ingestion gateway extracts snapshot and change data from the source database and stores it in the Unity Catalog staging volume. You must run the gateway as a continuous pipeline. This helps to accommodate any change log retention policies that you have on the source database.

The ingestion pipeline applies the snapshot and change data from the staging volume into destination streaming tables.

note

Each ingestion pipeline must be associated with exactly one ingestion gateway.

The ingestion pipeline does not support more than one destination catalog and schema. If you need to write to multiple destination catalogs or schemas, create multiple gateway-pipeline pairs.

This tab describes how to deploy an ingestion pipeline using Databricks Asset Bundles. Bundles can contain YAML definitions of jobs and tasks, are managed using the Databricks CLI, and can be shared and run in different target workspaces (such as development, staging, and production). For more information, see Databricks Asset Bundles.

  1. Create a new bundle using the Databricks CLI:

    Bash
    databricks bundle init
  2. Add two new resource files to the bundle:

    • A pipeline definition file (resources/sqlserver_pipeline.yml).
    • A workflow file that controls the frequency of data ingestion (resources/sqlserver.yml).

    The following is an example resources/sqlserver_pipeline.yml file:

    YAML
    variables:
    # Common variables used multiple places in the DAB definition.
    gateway_name:
    default: sqlserver-gateway
    dest_catalog:
    default: main
    dest_schema:
    default: ingest-destination-schema

    resources:
    pipelines:
    gateway:
    name: ${var.gateway_name}
    gateway_definition:
    connection_name: <sqlserver-connection>
    gateway_storage_catalog: main
    gateway_storage_schema: ${var.dest_schema}
    gateway_storage_name: ${var.gateway_name}
    target: ${var.dest_schema}
    catalog: ${var.dest_catalog}
    channel: PREVIEW

    pipeline_sqlserver:
    name: sqlserver-ingestion-pipeline
    ingestion_definition:
    ingestion_gateway_id: ${resources.pipelines.gateway.id}
    objects:
    # Modify this with your tables!
    - table:
    # Ingest the table test.ingestion_demo_lineitem to dest_catalog.dest_schema.ingestion_demo_line_item.
    source_catalog: test
    source_schema: ingestion_demo
    source_table: lineitem
    destination_catalog: ${var.dest_catalog}
    destination_schema: ${var.dest_schema}
    - schema:
    # Ingest all tables in the test.ingestion_whole_schema schema to dest_catalog.dest_schema. The destination
    # table name will be the same as it is on the source.
    source_catalog: test
    source_schema: ingestion_whole_schema
    destination_catalog: ${var.dest_catalog}
    destination_schema: ${var.dest_schema}
    target: ${var.dest_schema}
    catalog: ${var.dest_catalog}
    channel: PREVIEW

    The following is an example resources/sqlserver_job.yml file:

    YAML
    resources:
    jobs:
    sqlserver_dab_job:
    name: sqlserver_dab_job

    trigger:
    # Run this job every day, exactly one day from the last run
    # See https://docs.databricks.com/api/workspace/jobs/create#trigger
    periodic:
    interval: 1
    unit: DAYS

    email_notifications:
    on_failure:
    - <email-address>

    tasks:
    - task_key: refresh_pipeline
    pipeline_task:
    pipeline_id: ${resources.pipelines.pipeline_sqlserver.id}
  3. Deploy the pipeline using the Databricks CLI:

    Bash
    databricks bundle deploy

Start, schedule, and set alerts on your pipeline

You can create a schedule for the pipeline on the pipeline details page.

  1. After the pipeline has been created, revisit the Databricks workspace, and then click Pipelines.

    The new pipeline appears in the pipeline list.

  2. To view the pipeline details, click the pipeline name.

  3. On the pipeline details page, you can schedule the pipeline by clicking Schedule.

  4. To set notifications on the pipeline, click Settings, and then add a notification.

For each schedule that you add to a pipeline, Lakeflow Connect automatically creates a job for it. The ingestion pipeline is a task within the job. You can optionally add more tasks to the job.

Verify successful data ingestion

The list view on the pipeline details page shows the number of records processed as data is ingested. These numbers refresh automatically.

Verify replication

The Upserted records and Deleted records columns are not shown by default. You can enable them by clicking on the columns configuration Columns configuration icon button and selecting them.

Additional resources