Capture and view data lineage with Unity Catalog

Preview

This feature is in Public Preview.

You can use Unity Catalog to capture runtime data lineage across queries run on a Databricks cluster or SQL warehouse. Lineage is supported for all languages and is captured down to the column level. Lingeage data includes notebooks, workflows, and dashboards related to the query. Lineage can be visualized in Data Explorer in near real-time and retrieved with the Databricks REST API.

Lineage is aggregated across all workspaces attached to a Unity Catalog metastore. This means that lineage captured in one workspace is visible in any other workspace sharing that metastore. Users must have the correct permissions to view the lineage data. Lineage data is retained for 30 days.

This article describes visualizing lineage using Data Explorer and the REST API.

Requirements

The following are required to capture data lineage with Unity Catalog:

  • The workspace must have Unity Catalog enabled and be launched in the Premium tier.

  • Tables must be registered in a Unity Catalog metastore to be eligible for lineage capture.

  • Clusters and SQL warehouses must include the spark.databricks.dataLineage.enabled true Spark configuration. For more information about enabling data lineage on a cluster, see Data lineage. For more information about enabling data lineage on a SQL warehouse, see Configure a SQL warehouse to capture data lineage.

  • Queries must use the Spark DataFrame (for example, Spark SQL functions that return a DataFrame) or Databricks SQL interfaces. For examples of Databricks SQL and PySpark queries, see Examples.

  • To view the lineage of a table or view, users must have the SELECT privilege on the table or view.

  • To view lineage information for notebooks, workflows, or dashboards, users must have permissions on these objects as defined by the access control settings in the workspace. See Lineage permissions.

  • To capture lineage, you must create and modify data using tables. Lineage is not captured for data written directly to files.

Limitations

  • Streaming between Delta tables is supported only in Databricks Runtime 11.2 or higher.

  • Lineage is not captured when data is written directly to files in cloud storage, even if a table is defined at the cloud storage location. For example, spark.write.save(“s3://mybucket/mytable/”) will not produce lineage.

  • Because lineage is computed on a 30-day rolling window, lineage is not displayed for tables that have not been modified within the last 30 days.

  • Workflows that use the Jobs API runs submit request are unavailable when viewing lineage. Table and column level lineage is still captured when using the runs submit request, but the link to the run is not captured.

  • Lineage is not captured for Delta Live Tables pipelines.

Examples

Note

  • The following examples use the catalog name lineage_data and the schema name lineagedemo. To use a different catalog and schema, change the names used in the examples.

  • To complete this example, you must have CREATE and USAGE privileges on a schema. A metastore admin, catalog owner, or schema owner can grant these privileges. For example, to give all users in the group ‘data_engineers’ permission to create tables in the lineagedemo schema in the lineage_data catalog, a metastore admin can run the following queries:

    CREATE SCHEMA lineage_data.lineagedemo;
    GRANT USAGE, CREATE on SCHEMA lineage_data.lineagedemo to `data_engineers`;
    
  • You must use a cluster enabled for data lineage.

Capture and explore lineage

To capture lineage data, use the following steps:

  1. Go to your Databricks landing page, click Create Icon Create in the sidebar, and select Notebook from the menu.

  2. Enter a name for the notebook and select SQL in Default Language.

  3. In Cluster, select a lineage-enabled cluster.

  4. Click Create.

  5. In the first notebook cell, enter the following queries:

    CREATE TABLE IF NOT EXISTS
      lineage_data.lineagedemo.menu (
        recipe_id INT,
        app string,
        main string,
        dessert string
      );
    
    INSERT INTO lineage_data.lineagedemo.menu
        (recipe_id, app, main, dessert)
    VALUES
        (1,"Ceviche", "Tacos", "Flan"),
        (2,"Tomato Soup", "Souffle", "Creme Brulee"),
        (3,"Chips","Grilled Cheese","Cheesecake");
    
    CREATE TABLE
      lineage_data.lineagedemo.dinner
    AS SELECT
      recipe_id, concat(app," + ", main," + ",dessert)
    AS
      full_menu
    FROM
      lineage_data.lineagedemo.menu
    
  6. To run the queries, click in the cell and press shift+enter or click Run Menu and select Run Cell.

To use Data Explorer to view the lineage generated by these queries, use the following steps:

  1. Click Data in the sidebar.

  2. Click on the catalog name, click lineagedemo, and select the dinner table. You can also use the Search tables text box in the top bar to search for the dinner table.

  3. Select the Lineage tab. The Lineage panel appears and displays the menu table.

    Lineage panel
  4. To view an interactive graph of the data lineage, click See Lineage Graph. By default, one level is displayed in the graph. You can click on nodes to reveal more connections if they are available.

  5. Click on an arrow connecting nodes in the lineage graph to open the Lineage connection panel. The Lineage connection panel shows details about the connection, including source and target tables, notebooks, and workflows.

    Lineage graph
  6. To show the notebook associated with the dinner table, select the notebook in the Lineage connection panel or close the lineage graph and click Notebooks. To open the notebook in a new tab, click on the notebook name.

  7. To view the column-level lineage, click the Schema tab, select the dinner table, and click the full_menu column. The Column lineage panel displays the three columns from the menu table concatenated to derive the full_menu column.

    Full menu column lineage

To demonstrate creating and viewing lineage with a different language, for example, Python, use the following steps:

  1. Open the notebook you created previously, create a new cell, and enter the following Python code:

    %python
    from pyspark.sql.functions import rand, round
    df = spark.range(3).withColumn("price", round(10*rand(seed=42),2)).withColumnRenamed("id","recipe_id")
    
    df.write.mode("overwrite").saveAsTable("lineage_data.lineagedemo.price")
    
    dinner = spark.read.table("lineage_data.lineagedemo.dinner")
    price = spark.read.table("lineage_data.lineagedemo.price")
    
    dinner_price = dinner.join(price, on="recipe_id")
    dinner_price.write.mode("overwrite").saveAsTable("lineage_data.lineagedemo.dinner_price")
    
  2. Run the cell by clicking in the cell and pressing shift+enter or clicking Run Menu and selecting Run Cell.

  3. Open Data Explorer, click on the catalog name, click lineagedemo, and select the price table.

  4. Select the Lineage tab and click See Lineage Graph. Click on the lineage graph nodes to explore the data lineage generated by the SQL and Python queries.

    Expanded lineage graph
  5. Click on an arrow connecting nodes in the lineage graph to open the Lineage connection panel. The Lineage connection panel shows details about the connection, including source and target tables, notebooks, and workflows.

Capture and view workflow lineage

To capture lineage, your workflow must run on a cluster that includes the spark.databricks.dataLineage.enabled true Spark configuration. For more information about enabling data lineage on a cluster, see Data lineage.

To demonstrate viewing lineage for a Databricks workflow, use the following steps:

  1. Go to your Databricks landing page and switch to the Data Science & Engineering persona.

  2. Click Create Icon Create in the sidebar and select Notebook from the menu.

  3. Enter a name for the notebook and select SQL in Default Language.

  4. Click Create.

  5. In the first notebook cell, enter the following query:

    SELECT * FROM lineage_data.lineagedemo.menu
    
  6. Click Schedule in the top bar. In the schedule dialog, select Manual, select a lineage-enabled cluster, and click Create.

  7. Click Run now.

  8. Click Data in the sidebar to open the Data Explorer.

  9. Click on the catalog name, click lineagedemo, and select the menu table. You can also use the Search tables text box in the top bar to search for the menu table.

  10. Select the Lineage tab, click Workflows, and select the Downstream tab. The job name appears under Job Name as a consumer of the menu table.

Capture and view dashboard lineage

To demonstrate viewing lineage for a SQL dashboard, use the following steps:

Note

You must use a SQL warehouse enabled for data lineage.

  1. Go to your Databricks landing page and open the Data Explorer by clicking Data in the sidebar.

  2. Click on the catalog name, click lineagedemo, and select the menu table. You can also use the Search tables text box in the top bar to search for the menu table.

  3. Click Actions > Create a quick dashboard.

  4. Select columns to add to the dashboard and click Create.

  5. Go to the Data explorer, click on the catalog name, click lineagedemo, and select the menu table.

  6. Select the Lineage tab and click Dashboards. The dashboard name appears under Dashboard Name as a consumer of the menu table.

Lineage permissions

Lineage graphs share the same permission model as Unity Catalog. If a user does not have the SELECT privilege on a table, they will not be able to explore the lineage. Additionally, users can only see notebooks, workflows, and dashboards they have permission to view. For example, if you run the following commands for a non-admin user userA:

GRANT USAGE on lineage_data.lineagedemo to `userA@company.com`;
GRANT SELECT on lineage_data.lineagedemo.menu to `userA@company.com`;

When userA views the lineage graph for the lineage_data.lineagedemo.menu table, they will see the menu table, but will not be able to see information about associated tables, for example, the downstream lineage_data.lineagedemo.dinner table. The dinner table is displayed as a masked node in the display to userA, and userA cannot expand the graph to reveal downstream tables from tables they do not have permission to access.

For more information about managing access to workspace objects, see Workspace object access control.

Delete lineage data

Warning

The following instructions delete all objects stored in Unity Catalog. Use these instructions only if necessary. For example, to meet compliance requirements.

To delete lineage data, you must delete the metastore managing the Unity Catalog objects. For more information about deleting the metastore, see Delete a metastore. Data will be deleted within 30 days.

Data lineage API

The data lineage API allows you to retrieve table and column lineage.

Important

To access Databricks REST APIs, you must authenticate.

Retrieve table lineage

This example retrieves lineage data for the dinner table.

Request

curl --netrc -X GET \
-H 'Content-Type: application/json' \
https://<databricks-instance/api/2.0/lineage-tracking/table-lineage \
-d '{"table_name": "lineage_data.lineagedemo.dinner", "include_entity_lineage": true}}'

Replace <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Response

{
  "upstreams": [
    {
      "tableInfo": {
        "name": "menu",
        "catalog_name": "lineage_data",
        "schema_name": "lineagedemo",
        "table_type": "TABLE"
      },
      "notebookInfos": [
        {
          "workspace_id": 4169371664718798,
          "notebook_id": 1111169262439324
        }
      ]
    }
  ],
  "downstreams": [
    {
      "notebookInfos": [
        {
          "workspace_id": 4169371664718798,
          "notebook_id": 1111169262439324
        }
      ]
    },
    {
      "tableInfo": {
        "name": "dinner_price",
        "catalog_name": "lineage_data",
        "schema_name": "lineagedemo",
        "table_type": "TABLE"
      },
      "notebookInfos": [
        {
          "workspace_id": 4169371664718798,
          "notebook_id": 1111169262439324
        }
      ]
    }
  ]
}

Retrieve column lineage

This example retrieves column data for the dinner table.

Request

curl --netrc -X GET \
-H 'Content-Type: application/json' \
https://<databricks-instance/api/2.0/lineage-tracking/column-lineage \
-d '{"table_name": "lineage_data.lineagedemo.dinner", "include_entity_lineage": true}}'

Replace <databricks-instance> with the Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

This example uses a .netrc file.

Response

{
  "upstream_cols": [
    {
      "name": "dessert",
      "catalog_name": "lineage_data",
      "schema_name": "lineagedemo",
      "table_name": "menu",
      "table_type": "TABLE"
    },
    {
      "name": "main",
      "catalog_name": "lineage_data",
      "schema_name": "lineagedemo",
      "table_name": "menu",
      "table_type": "TABLE"
    },
    {
      "name": "app",
      "catalog_name": "lineage_data",
      "schema_name": "lineagedemo",
      "table_name": "menu",
      "table_type": "TABLE"
    }
  ],
  "downstream_cols": [
    {
      "name": "full_menu",
      "catalog_name": "lineage_data",
      "schema_name": "lineagedemo",
      "table_name": "dinner_price",
      "table_type": "TABLE"
    }
  ]
}