Skip to main content

Capture and view data lineage using Unity Catalog

This article describes how to capture and visualize data lineage using Catalog Explorer.

You can use Unity Catalog to capture runtime data lineage across queries run on SAP Databricks. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks related to the query. Lineage can be visualized in Catalog Explorer in near real time.

Lineage is aggregated across all workspaces attached to a Unity Catalog metastore. This means that lineage captured in one workspace is visible in any other workspace that shares that metastore. Specifically, tables and other data objects registered in the metastore are visible to users who have at least BROWSE permissions on those objects, across all workspaces attached to the metastore. However, detailed information about workspace-level objects like notebooks in other workspaces is masked.

Lineage data is retained for one year.

Requirements

The following are required to capture data lineage using Unity Catalog:

  • The workspace must have Unity Catalog enabled.
  • Tables must be registered in a Unity Catalog metastore.
  • Queries must use the Spark DataFrame (for example, Spark SQL functions that return a DataFrame) or Databricks SQL interfaces.
  • To view the lineage of a table or view, users must have at least the BROWSE privilege on the parent catalog of the table or view. The parent catalog must also be accessible from the workspace.
  • To view lineage information for notebooks, users must have permissions on these objects as defined by the access control settings in the workspace.

Examples

note
  • The following examples use the catalog name lineage_data and the schema name lineagedemo. To use a different catalog and schema, change the names used in the examples.

  • To complete this example, you must have CREATE and USE SCHEMA privileges on a schema. A metastore admin, catalog owner, schema owner, or user with the MANAGE privilege on the schema can grant these privileges. For example, to give all users in the group 'data_engineers' permission to create tables in the lineagedemo schema in the lineage_data catalog, a user with one of the above privileges or roles can run the following queries:

    SQL
    CREATE SCHEMA lineage_data.lineagedemo;
    GRANT USE SCHEMA, CREATE on SCHEMA lineage_data.lineagedemo to `data_engineers`;

Capture and explore lineage

To capture lineage data:

  1. Go to your SAP Databricks landing page, click New Icon New in the sidebar, and select Notebook from the menu.

  2. Enter a name for the notebook and select SQL in Default Language.

  3. Click Create.

  4. In the first notebook cell, enter the following queries:

    SQL
    CREATE TABLE IF NOT EXISTS
    lineage_data.lineagedemo.menu (
    recipe_id INT,
    app string,
    main string,
    dessert string
    );

    INSERT INTO lineage_data.lineagedemo.menu
    (recipe_id, app, main, dessert)
    VALUES
    (1,"Ceviche", "Tacos", "Flan"),
    (2,"Tomato Soup", "Souffle", "Creme Brulee"),
    (3,"Chips","Grilled Cheese","Cheesecake");

    CREATE TABLE
    lineage_data.lineagedemo.dinner
    AS SELECT
    recipe_id, concat(app," + ", main," + ",dessert)
    AS
    full_menu
    FROM
    lineage_data.lineagedemo.menu
  5. To run the queries, click in the cell and press shift+enter or click Run Menu and select Run Cell.

To use Catalog Explorer to view the lineage generated by these queries:

  1. In the Search box in the top bar of the SAP Databricks workspace, search for the lineage_data.lineagedemo.dinner table and select it.

  2. Select the Lineage tab. The lineage panel appears and displays related tables (for this example it's the menu table).

  3. To view an interactive graph of the data lineage, click See Lineage Graph. By default, one level is displayed in the graph. Click the Plus Sign Icon icon on a node to reveal more connections if they are available.

  4. Click an arrow that connects nodes in the lineage graph to open the Lineage connection panel. The Lineage connection panel shows details about the connection, including source and target tables, notebooks, and jobs.

    Lineage graph.

  5. To show the notebook associated with the dinner table, select the notebook in the Lineage connection panel or close the lineage graph and click Notebooks. To open the notebook in a new tab, click the notebook name.

  6. To view the column-level lineage, click a column in the graph to show links to related columns. For example, clicking on the 'full_menu' column shows the upstream columns the column was derived from:

    Full menu column lineage.

To view lineage using a different language, for example, Python:

  1. Open the notebook you created previously, create a new cell, and enter the following Python code:

    Python
    %python
    from pyspark.sql.functions import rand, round
    df = spark.range(3).withColumn("price", round(10*rand(seed=42),2)).withColumnRenamed("id","recipe_id")

    df.write.mode("overwrite").saveAsTable("lineage_data.lineagedemo.price")

    dinner = spark.read.table("lineage_data.lineagedemo.dinner")
    price = spark.read.table("lineage_data.lineagedemo.price")

    dinner_price = dinner.join(price, on="recipe_id")
    dinner_price.write.mode("overwrite").saveAsTable("lineage_data.lineagedemo.dinner_price")
  2. Run the cell by clicking in the cell and pressing shift+enter or clicking Run Menu and selecting Run Cell.

  3. In the Search box in the top bar of the SAP Databricks workspace, search for the lineage_data.lineagedemo.price table and select it.

  4. Go to the Lineage tab and click See Lineage Graph. Click the Plus Sign Icon icons to explore the data lineage generated by the queries.

    Expanded lineage graph.

  5. Click an arrow that connects nodes in the lineage graph to open the Lineage connection panel. The Lineage connection panel shows details about the connection, including source and target tables, notebooks, and jobs.

Lineage permissions

Lineage graphs share the same permission model as Unity Catalog. Tables and other data objects registered in the Unity Catalog metastore are visible only to users who have at least BROWSE permissions on those objects. If a user does not have the BROWSE or SELECT privilege on a table, they cannot explore its lineage. Lineage graphs display Unity Catalog objects across all workspaces attached to the metastore, as long as the user has adequate object permissions.

For example, run the following commands for userA:

SQL
GRANT USE SCHEMA on lineage_data.lineagedemo to `userA@company.com`;
GRANT SELECT on lineage_data.lineagedemo.menu to `userA@company.com`;

When userA views the lineage graph for the lineage_data.lineagedemo.menu table, they will see the menu table. They will not be able to see information about associated tables, such as the downstream lineage_data.lineagedemo.dinner table. The dinner table is displayed as a masked node in the display to userA, and userA cannot expand the graph to reveal downstream tables from tables they do not have permission to access.

If you run the following command to grant the BROWSE permission to userB, that user can view the lineage graph for any table in the lineage_data schema.:

SQL
GRANT BROWSE on lineage_data to `userB@company.com`;

Likewise, lineage users must have specific permissions to view workspace objects like notebooks. In addition, they can only see detailed information about workspace objects when they are logged into the workspace in which those objects were created. Detailed information about workspace-level objects in other workspaces is masked in the lineage graph.

Delete lineage data

warning

The following instructions delete all objects stored in Unity Catalog. Use these instructions only if necessary. For example, to meet compliance requirements.

To delete lineage data, you must delete the metastore managing the Unity Catalog objects. Data will be deleted within 90 days.

Get table lineage using Databricks Assistant

Databricks Assistant provides detailed information on table lineages and insights.

To get lineage information using Assistant:

  1. Go to your SAP Databricks landing page and open Catalog Explorer by clicking Catalog icon Catalog in the sidebar.
  2. Click the catalog name, and then click the In-product help assistant icon - color Assistant icon in the upper-right corner.
  3. At the Assistant prompt, type:
    • /getTableLineages to view upstream and downstream dependencies.
    • /getTableInsights to access metadata-driven insights, such as user activity and query patterns.

These queries enable Assistant to answer questions like “show me downstream lineages” or “who queries this table most often.”

Databricks Assistant provides table lineage and insights.