View data lineage using Unity Catalog

This page describes how to visualize data lineage using Catalog Explorer, the data lineage system tables, and the REST API.

Data lineage overview

Unity Catalog captures runtime data lineage across queries run on Databricks. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks, jobs, and dashboards related to the query. Lineage can be visualized in Catalog Explorer in near real time and retrieved programmatically using the lineage system tables and the Databricks REST API.

Lineage can also include external assets and workflows that are run outside of Databricks. This external lineage metadata feature is in Public Preview. See Bring your own data lineage.

Lineage is aggregated across all workspaces attached to a Unity Catalog metastore. This means that lineage captured in one workspace is visible in any other workspace that shares that metastore. Specifically, tables and other data objects registered in the metastore are visible to users who have at least BROWSE permissions on those objects, across all workspaces attached to the metastore. However, detailed information about workspace-level objects like notebooks and dashboards in other workspaces is masked (see Lineage limitations and Lineage permissions).

Lineage data is retained for one year.

The following image is a sample lineage graph.

Lineage overview.

For a demo of viewing data lineage, see Unity Catalog - Data Lineage.

For information about tracking the lineage of a machine learning model, see Track the data lineage of a model in Unity Catalog.

Requirements

To capture data lineage using Unity Catalog:

Tables must be registered in a Unity Catalog metastore.
External assets (those not registered in the Unity Catalog metastore) must be added as external metadata objects in Unity Catalog, configured to have relationships with other securable objects registered in your Unity Catalog metastore. See Bring your own data lineage.
Queries must use the Spark DataFrame (for example, Spark SQL functions that return a DataFrame) or Databricks SQL interfaces such as notebooks or the SQL query editor.

To view data lineage:

You must have at least the BROWSE privilege on the parent catalog of the table or view. The parent catalog must also be accessible from the workspace. See Limit catalog access to specific workspaces.
For notebooks, jobs, or dashboards, you must have permissions on these objects as defined by the access control settings in the workspace. For details, see Lineage permissions.
For a Unity Catalog-enabled pipeline, you must have CAN VIEW permission on the pipeline.

Compute requirements:

Lineage tracking of streaming between Delta tables requires Databricks Runtime 11.3 LTS or above.
Column lineage tracking for Lakeflow Declarative Pipelines workloads requires Databricks Runtime 13.3 LTS or above.

Networking requirements:

You might need to update your outbound firewall rules to allow for connectivity to the Amazon Kinesis endpoint in the Databricks control plane. Typically this applies if your Databricks workspace is deployed in your own VPC or you use AWS PrivateLink within your Databricks network environment. To get the Kinesis endpoint for your workspace region, see Kinesis addresses. See also Configure a customer-managed VPC and Enable private connectivity using AWS PrivateLink.

View data lineage using Catalog Explorer

To use Catalog Explorer to view table lineage:

In your Databricks workspace, click Catalog.
Search or browse for your table.
Select the Lineage tab. The lineage panel appears and displays related tables.
To view an interactive graph of the data lineage, click See Lineage Graph.

By default, one level is displayed in the graph. Click the icon on a node to reveal more connections if they are available.
Click an arrow that connects nodes in the lineage graph to open the Lineage connection panel.

The Lineage connection panel shows details about the connection, including source and target tables, notebooks, and jobs.
To show a notebook associated with a table, select the notebook in the Lineage connection panel or close the lineage graph and click Notebooks.

To open the notebook in a new tab, click the notebook name.
To view column-level lineage, click a column in the graph to show links to related columns. For example, clicking on the full_menu column in this sample graph shows the upstream columns the column was derived from:

View job lineage

To view job lineage, go to a table's Lineage tab, select Jobs, and select Downstream. The job name appears under Job Name as a consumer of the table.

View dashboard lineage

To view dashboard lineage, go to a table's Lineage tab and click Dashboards. The dashboard appears under Dashboard Name as a consumer of the table.

Get table lineage using Databricks Assistant

Databricks Assistant provides detailed information about table lineages and insights.

To get lineage information using Assistant:

In the workspace sidebar, click Catalog.
Browse or search for the catalog, click the catalog name, and then click the Assistant icon in the upper-right corner.
At the Assistant prompt, type:
- /getTableLineages to view upstream and downstream dependencies.
- /getTableInsights to access metadata-driven insights, such as user activity and query patterns.

These queries enable Assistant to answer questions like “show me downstream lineages” or “who queries this table most often.”

Databricks Assistant provides table lineage and insights.

Query lineage data using system tables

You can use the lineage system tables to programmatically query lineage data. For detailed instructions, see Monitor account activity with system tables and Lineage system tables reference.

If your workspace is in a region that doesn't support lineage system tables, you can instead use the Data Lineage REST API to retrieve lineage data programmatically.

Retrieve lineage using the Data Lineage REST API

The data lineage API allows you to retrieve table and column lineage. However, if your workspace is in a region that supports the lineage system tables, you should use system table queries instead of the REST API. System tables are a better option for programmatic retrieval of lineage data. Most regions support the lineage system tables.

important

To access Databricks REST APIs, you must authenticate.

Retrieve table lineage

This example retrieves lineage data for the dinner table.

Request

Bash
curl --netrc -X GET \
-H 'Content-Type: application/json' \
https://<workspace-instance>/api/2.0/lineage-tracking/table-lineage \
-d '{"table_name": "lineage_data.lineagedemo.dinner", "include_entity_lineage": true}'

Replace <workspace-instance>.

This example uses a .netrc file.

Response

JSON
{
  "upstreams": [
    {
      "tableInfo": {
        "name": "menu",
        "catalog_name": "lineage_data",
        "schema_name": "lineagedemo",
        "table_type": "TABLE"
      },
      "notebookInfos": [
        {
          "workspace_id": 4169371664718798,
          "notebook_id": 1111169262439324
        }
      ]
    }
  ],
  "downstreams": [
    {
      "notebookInfos": [
        {
          "workspace_id": 4169371664718798,
          "notebook_id": 1111169262439324
        }
      ]
    },
    {
      "tableInfo": {
        "name": "dinner_price",
        "catalog_name": "lineage_data",
        "schema_name": "lineagedemo",
        "table_type": "TABLE"
      },
      "notebookInfos": [
        {
          "workspace_id": 4169371664718798,
          "notebook_id": 1111169262439324
        }
      ]
    }
  ]
}

Retrieve column lineage

This example retrieves column data for the dinner table.

Request

Bash
curl --netrc -X GET \
-H 'Content-Type: application/json' \
https://<workspace-instance>/api/2.0/lineage-tracking/column-lineage \
-d '{"table_name": "lineage_data.lineagedemo.dinner", "column_name": "dessert"}'

Replace <workspace-instance>.

This example uses a .netrc file.

Response

JSON
{
  "upstream_cols": [
    {
      "name": "dessert",
      "catalog_name": "lineage_data",
      "schema_name": "lineagedemo",
      "table_name": "menu",
      "table_type": "TABLE"
    },
    {
      "name": "main",
      "catalog_name": "lineage_data",
      "schema_name": "lineagedemo",
      "table_name": "menu",
      "table_type": "TABLE"
    },
    {
      "name": "app",
      "catalog_name": "lineage_data",
      "schema_name": "lineagedemo",
      "table_name": "menu",
      "table_type": "TABLE"
    }
  ],
  "downstream_cols": [
    {
      "name": "full_menu",
      "catalog_name": "lineage_data",
      "schema_name": "lineagedemo",
      "table_name": "dinner_price",
      "table_type": "TABLE"
    }
  ]
}

Lineage permissions

Lineage graphs share the same permission model as Unity Catalog. Tables and other data objects registered in the Unity Catalog metastore are visible only to users who have at least BROWSE permissions on those objects. If a user does not have the BROWSE or SELECT privilege on a table, they cannot explore its lineage. Lineage graphs display Unity Catalog objects across all workspaces attached to the metastore, as long as the user has adequate object permissions.

For example, run the following commands for userA:

SQL
GRANT USE SCHEMA on lineage_data.lineagedemo to `userA@company.com`;
GRANT SELECT on lineage_data.lineagedemo.menu to `userA@company.com`;

When userA views the lineage graph for the lineage_data.lineagedemo.menu table, they will see the menu table. They will not be able to see information about associated tables, such as the downstream lineage_data.lineagedemo.dinner table. The dinner table is displayed as a masked node in the display to userA, and userA cannot expand the graph to reveal downstream tables from tables they do not have permission to access.

If you run the following command to grant the BROWSE permission to userB, that user can view the lineage graph for any table in the lineage_data schema:

SQL
GRANT BROWSE on lineage_data to `userB@company.com`;

Likewise, lineage users must have specific permissions to view workspace objects like notebooks, jobs, and dashboards. In addition, they can only see detailed information about workspace objects when they are logged into the workspace in which those objects were created. Detailed information about workspace-level objects in other workspaces is masked in the lineage graph.

For more information about managing access to securable objects in Unity Catalog, see Manage privileges in Unity Catalog. For more information about managing access to workspace objects like notebooks, jobs, and dashboards, see Access control lists.

Lineage limitations

Data lineage has the following limitations. These limitations also apply to lineage system tables:

Although lineage is aggregated for all workspaces that are attached to the same Unity Catalog metastore, details for workspace objects like notebooks and dashboards are visible only in the workspace in which they were created.
Because lineage is computed on a one-year rolling window, lineage collected more than one year ago is not displayed. For example, if a job or query reads data from table A and writes to table B, the link between table A and table B is displayed for one year only. You can filter lineage data by time frame within the one-year window.
Jobs that use the Jobs API runs submit request or the spark submit task type are unavailable in lineage views. Table and column level lineage is still captured for these workflows, but the link to the job run is not captured.
If a table or view is renamed, lineage is not captured for the renamed table or view.
If a schema or catalog is renamed, lineage is not captured for tables and views under the renamed catalog or schema.
If you use Spark SQL dataset checkpointing, lineage is not captured.
Unity Catalog captures lineage from Lakeflow Declarative Pipelines in most cases. However, in some instances, complete lineage coverage cannot be guaranteed, such as when pipelines use PRIVATE tables.
Lineage does not capture Stack functions.
Global temp views are not captured in lineage.
Tables under system.information_schema are not captured in lineage.
Unity Catalog captures lineage to the column level as much as possible. However, there are some cases where column-level lineage cannot be captured. These include:
- Column lineage cannot be captured if the source or the target is referenced as path (Example: select * from delta."s3://<bucket>/<path>"). Column lineage is supported only when both the source and target are referenced by table name (Example: select * from <catalog>.<schema>.<table>).
- Use of common table expressions (CTEs), column renaming, user-defined functions (UDFs), or Resilient Distributed Datasets (RDDs), all of which can obscure the mapping between source and target columns.
- Complete column-level lineage is not captured by default for MERGE operations.
  
  You can turn on lineage capture for MERGE operations by setting the Spark property spark.databricks.dataLineage.mergeIntoV2Enabled to true. Enabling this flag can slow down query performance, particularly in workloads that involve very wide tables.

Data lineage overview​

Requirements​

View data lineage using Catalog Explorer​

View job lineage​

View dashboard lineage​

Get table lineage using Databricks Assistant​

Query lineage data using system tables​

Retrieve lineage using the Data Lineage REST API​

Retrieve table lineage​

Request​

Response​

Retrieve column lineage​

Request​

Response​

Lineage permissions​

Lineage limitations​

Data lineage overview

Requirements

View data lineage using Catalog Explorer

View job lineage

View dashboard lineage

Get table lineage using Databricks Assistant

Query lineage data using system tables

Retrieve lineage using the Data Lineage REST API

Retrieve table lineage

Request

Response

Retrieve column lineage

Request

Response

Lineage permissions

Lineage limitations