Track source data lineage for managed ingestion pipelines
Applies to: SaaS connectors
Database connectors
When a managed ingestion pipeline runs, Lakeflow Connect automatically records lineage from the source tables in your SaaS application or source database to the destination Delta tables in Unity Catalog. This extends the lineage graph that Unity Catalog already captures for downstream queries, jobs, dashboards, and notebooks so that you can trace ingested data end-to-end. End-to-end source lineage supports data governance, discovery, and change management workflows for ingested data.
For each source table, Lakeflow Connect writes a Unity Catalog external metadata object (the upstream node in the lineage graph) and an external lineage relationship from that object to the destination table, with column-level mappings. For background on external lineage in Unity Catalog, see Bring your own data lineage.
Requirements
The identity that runs the pipeline must have the CREATE EXTERNAL METADATA privilege on the metastore. If the pipeline is configured to run as a service principal, grant the privilege to the service principal. See Configure the Run as identity for a pipeline.
There is no setting to enable on the pipeline. After a pipeline update completes, the pipeline populates source lineage automatically.
How pipelines populate source lineage
After a pipeline update finishes processing a table, Lakeflow Connect does the following for each ingested source object:
- Creates or updates a Unity Catalog external metadata object that represents the source table. The object records the source connection name, source catalog, schema, and table, along with the source column names and the source system type (for example,
MicrosoftSQLServer,PostgreSQL,Salesforce). - Creates or updates an external lineage relationship from the external metadata object to the destination Delta table, with a 1:1 column-level mapping.
The external metadata name is <connection-name>:<source-table-full-name>, with each . replaced by __. For example, a SQL Server connection named sql_prod ingesting sales.dbo.Customers produces the external metadata name sql_prod:sales__dbo__Customers. Because the name is keyed on the connection, all pipelines that ingest the same source table through the same connection share the same external metadata object and the same upstream lineage edges.
Lineage creation is best-effort. If writing lineage metadata fails (for example, because of a missing privilege), the pipeline logs the failure and continues. After you fix the underlying issue, the next pipeline update populates the missing lineage.
View source lineage
To view source lineage for an ingested table:
- In your Databricks workspace, click
Catalog.
- Open the destination Delta table that the pipeline writes to.
- Click the Lineage tab.
The upstream node is the external metadata object that represents the source table. Click the node to see the source connection, source catalog, schema, and table, along with the column-level mappings to the destination table.
Limitations
- Each source table is represented by a single external metadata object per connection. That means:
- Manual edits to the external metadata object don't persist: The next pipeline update overwrites them with values derived from the pipeline configuration.
- Pipelines that share a connection share the same upstream lineage. If multiple pipelines use the same connection to ingest the same source table, each update overwrites the external metadata object. The overwrites are idempotent because Lakeflow Connect always writes the same content for a specified source table on a specified connection.
- Source system types that aren't recognized are recorded with the
Othersystem type. Recognized types are SQL Server, PostgreSQL, MySQL, Oracle, Salesforce, ServiceNow, and Workday.