Managed connectors in Lakeflow Connect
Managed connectors in Lakeflow Connect are in various release states.
This article provides an overview of managed connectors in Databricks Lakeflow Connect for ingesting data from SaaS applications and databases. The resulting ingestion pipeline is governed by Unity Catalog and is powered by serverless compute and DLT. Managed connectors leverage efficient incremental reads and writes to make data ingestion faster, scalable, and more cost-efficient, while your data remains fresh for downstream consumption.
SaaS connector components
A SaaS connector has the following components:
Component | Description |
---|---|
Connection | A Unity Catalog securable object that stores authentication details for the application. |
Ingestion pipeline | A pipeline that copies the data from the application into the destination tables. The ingestion pipeline runs on serverless compute. |
Destination tables | The tables where the ingestion pipeline writes the data. These are streaming tables, which are Delta tables with extra support for incremental data processing. |
Database connector components
A database connector has the following components:
Component | Description |
---|---|
Connection | A Unity Catalog securable object that stores authentication details for the database. |
Ingestion gateway | A pipeline that extracts snapshots, change logs, and metadata from the source database. The gateway runs on classic compute, and it runs continuously to capture changes before change logs can be truncated in the source. |
Staging storage | A Unity Catalog volume that temporarily stores extracted data before it's applied to the destination table. This allows you to run your ingestion pipeline at whatever schedule you'd like, even as the gateway continuously captures changes. It also helps with failure recovery. You automatically create a staging storage volume when you deploy the gateway, and you can customize the catalog and schema where it lives. Data is automatically purged from staging after 30 days. |
Ingestion pipeline | A pipeline that moves the data from staging storage into the destination tables. The pipeline runs on serverless compute. |
Destination tables | The tables where the ingestion pipeline writes the data. These are streaming tables, which are Delta tables with extra support for incremental data processing. |
Orchestration
You can run your ingestion pipeline on one or more custom schedule(s). For each schedule that you add to a pipeline, Lakeflow Connect automatically creates a job for it. The ingestion pipeline is a task within the job. You can optionally add more tasks to the job.
For database connectors, the ingestion gateway runs in its own job as a continuous task.
Incremental ingestion
Lakeflow Connect uses incremental ingestion to improve pipeline efficiency. On the first run of your pipeline, it ingests all of the selected data from the source. In parallel, it tracks changes to the source data. On each subsequent run of the pipeline, it uses that change tracking to ingest only the data that's changed from the prior run, whenever possible.
The exact approach depends on what's available in your data source. For example, you can use both change tracking and change data capture (CDC) with SQL Server. In contrast, the Salesforce connector selects a cursor column from a set list of options.
Some sources or specific tables don't support incremental ingestion at this time. Databricks plans to expand coverage for incremental support.
Networking
There are several options for connecting to a SaaS application or database.
- Connectors for SaaS applications reach out to the source's APIs. They're also automatically compatible with serverless egress controls.
- Connectors for cloud databases can connect to the source via Private Link. Alternatively, if your workspace has a Virtual Network (VNet) or Virtual Private Cloud (VPC) that's peered with the VNet or VPC hosting your database, then you can deploy the ingestion gateway inside of it.
- Connectors for on-premises databases can connect using services like AWS Direct Connect and Azure ExpressRoute.
Deployment
You can deploy ingestion pipelines using DABs, which enable best practices like source control, code review, testing, and continuous integration and delivery (CI/CD). Bundles are managed using the Databricks CLI and can be run in different target workspaces, such as development, staging, and production.
Failure recovery
As a fully-managed service, Lakeflow Connect aims to automatically recover from issues when possible. For example, when a connector fails, it automatically retries with exponential backoff.
However, it's possible that an error requires your intervention (for example, when credentials expire). In these cases, the connector tries to avoid missing data by storing the last position of the cursor. It can then pick back up from that position on the next run of the pipeline when possible.
Monitoring
Lakeflow Connect provides robust alerting and monitoring to help you maintain your pipelines. This includes event logs, cluster logs, pipeline health metrics, and data quality metrics.
History tracking
The history tracking setting, also known as the slowly changing dimensions (SCD) setting, determines how to handle changes in your data over time. Turn history tracking off (SCD type 1) to overwrite outdated records as they're updated and deleted in the source. Turn history tracking on (SCD type 2) to maintain a history of those changes. Deleting a table or column in the source does not delete that data from the destination, even when SCD type 1 is selected.
For example, let's say that you ingest the following table:
Let's also say that Alice's favorite color changes to purple on January 2.
If history tracking is off (SCD type 1), the next run of the ingestion pipeline updates that row in the destination table.
If history tracking is on (SCD type 2), the ingestion pipeline keeps the old row and adds the update as a new row. It marks the old row as inactive so that you know which row is up-to-date.
Not all connectors support history tracking (SCD type 2).
Feature compatibility
The following table summarizes feature availability per connector. For additional features and limitations, see the documentation for your specific connector.
Feature | Google Analytics | Salesforce | Workday | SQL Server | ServiceNow |
---|---|---|---|---|---|
Status | Public Preview | Public Preview | Public Preview | Gated Public Preview Reach out to your account team to learn more. | Gated Public Preview Reach out to your account team to learn more. |
UI-based pipeline authoring | |||||
API-based pipeline authoring | |||||
DABs | |||||
Incremental ingestion | With a temporary exception for formula fields | With exceptions when your table lacks a cursor field | |||
Unity Catalog governance | |||||
Orchestration using Databricks Workflows | |||||
SCD type 2 | |||||
API-based column selection and deselection | |||||
Automated schema evolution: New and deleted columns | |||||
Automated schema evolution: Data type changes | |||||
Automated schema evolution: Column renames | Treated as a new column (new name) and deleted column (old name). | Treated as a new column (new name) and deleted column (old name). | Treated as a new column (new name) and deleted column (old name). | When DDL objects are enabled, the connector can rename the column. When DDL objects are not enabled, the connector treats this as a new column (new name) and a deleted column (old name). In either case, it requires a full refresh. | Treated as a new column (new name) and deleted column (old name). |
Automated schema evolution: New tables | If you ingest the entire schema. See the limitations on the number of tables per pipeline. | If you ingest the entire schema. See the limitations on the number of tables per pipeline. | N/A | If you ingest the entire schema. See the limitations on the number of tables per pipeline. | If you ingest the entire schema. See the limitations on the number of tables per pipeline. |
Maximum number of tables per pipeline | 250 | 250 | 250 | 250 | 250 |
Dependence on external services
Databricks SaaS, database, and other fully-managed connectors depend on the accessibility, compatibility, and stability of the application, database, or external service they connect to. Databricks does not control these external services and, therefore, has limited (if any) influence over their changes, updates, and maintenance.
If changes, disruptions, or circumstances related to an external service impede or render impractical the operation of a connector, Databricks may discontinue or cease maintaining that connector. Databricks will make reasonable efforts to notify customers of discontinuation or cessation of maintenance, including updates to the applicable documentation.