Skip to main content

Maintain SQL Server ingestion pipelines

This page describes ongoing operations for maintaining SQL Server ingestion pipelines.

General pipeline maintenance

The pipeline maintenance tasks in this section apply to all managed connectors in Lakeflow Connect.

Fully refresh target tables

Fully refreshing the ingestion pipeline clears the data and state of the target tables, then reprocesses all records from the data source. You can fully refresh all tables in the pipeline or select tables to refresh.

  1. In the sidebar of the Databricks workspace, click Pipelines.
  2. Select the pipeline.
  3. On the pipeline details page, click Full refresh all or click Select tables for refresh, select the desired tables, then click Full refresh selection.
important

The ingestion pipeline update might fail during the Initializing or the Resetting tables phase. Lakeflow Connect will retry the pipeline automatically several times. If the automatic retries are manually interrupted or eventually fail fatally, start the new pipeline update manually with the table refresh selection from before. Failing to do so can result in the target tables being left in an inconsistent state with partial data. If manual retries also fail, create a support ticket.

Change the ingestion pipeline schedule

  1. In the sidebar of the Databricks workspace, click Pipelines.
  2. Select the pipeline, and then click Schedule.

Customize alerts and notifications

Lakeflow Connect automatically sets up notifications for all ingestion pipelines and scheduling jobs. You can customize notifications in the UI or using the Pipelines API.

  1. In the left-hand panel, click Pipelines.
  2. Select your pipeline.
  3. Click Schedule.
  4. If you already have a schedule that you want to receive notifications for: a. Identify the schedule on the list. a. Click the kebab menu, and then click Edit. a. Click More options, and then add your notifications.
  5. If you need a new schedule: a. Click Add schedule. a. Configure your schedule. a. Click More options, and then add your notifications.

Specify tables to ingest

The Pipelines API provides two methods to specify tables to ingest in the objects field of the ingestion_definition:

  • Table specification: Ingests an individual table from the specified source catalog and schema to the specified destination catalog and schema.
  • Schema specification: Ingests all tables from the specified source catalog and schema into the specified catalog and schema.

If you choose to ingest an entire schema, you should review the limitations on the number of tables per pipeline for your connector.

CLI commands

To edit the pipeline, run the following command:

databricks pipelines update --json "<<pipeline_definition OR json file path>"

To get the pipeline definition, run the following command:

databricks pipelines get "<your_pipeline_id>"

To delete the pipeline, run the following command:

databricks pipelines delete "<your_pipeline_id>"

For more information, you can always run the following command:

databricks pipelines --help
databricks pipelines <create|update|get|delete|...> --help

Remove unused staging files

For ingestion pipelines created after January 6, 2025, volume staging data is automatically scheduled for deletion after 25 days and physically removed after 30 days. An ingestion pipeline that has not completed successfully for 25 days or longer might result in data gaps in the destination tables. To avoid gaps, you must trigger a full refresh of the target tables.

For ingestion pipelines created before January 6, 2025, contact Databricks Support to request manual enablement of automatic retention management for staging CDC data.

The following data is automatically cleaned up:

  • CDC data files
  • Snapshot files
  • Staging table data

Configure full refresh behavior

You can schedule when full refresh snapshots occur and enable automatic full refresh to recover from unsupported schema changes. See Configure full refresh behavior for database connectors.

Restart the ingestion gateway

To decrease the load on the source database, the ingestion gateway only checks periodically for new tables. It might take up to 6 hours for new tables to be discovered. If you want to speed up this process, restart the gateway.

Full refresh behavior

When you trigger a full refresh of a table, Databricks optimizes the process to reduce downtime and maintain data availability:

  1. Snapshot request: When you request a full refresh, the ingestion gateway immediately begins creating a new snapshot of the source table. The destination streaming table is excluded from refresh selection until the snapshot completes.
  2. Continued availability: During the snapshot process, the destination streaming table retains its existing data and remains available for queries. No updates, appends, or deletes are applied to the table while the snapshot is in progress.
  3. Atomic refresh: After the snapshot completes, Databricks automatically performs the full refresh in a single update. This update applies all snapshot data and any CDC records accumulated since the snapshot was requested.

For example, if your table has 50 records at the end of update 15, and you request a full refresh in update 16:

  1. The ingestion gateway begins creating a snapshot during update 16.
  2. The table continues to show the original 50 records until the snapshot completes.
  3. When the snapshot completes (in update 16 or later, depending on the source table size), the full refresh is automatically applied in one atomic operation.

This approach significantly reduces downtime during full refresh operations and helps prevent PENDING_RESET and timeout errors.