Use Unity Catalog with pipelines

Databricks recommends configuring Lakeflow pipelines with Unity Catalog. Using Unity Catalog is the default for newly created pipelines.

Pipelines configured with Unity Catalog publish all defined materialized views and streaming tables to the specified catalog and schema. Unity Catalog pipelines can read from other Unity Catalog tables and volumes.

To manage permissions on the tables created by a Unity Catalog pipeline, use GRANT and REVOKE.

note

This article discusses functionality for the current default publishing mode for pipelines. Pipelines created before February 5, 2025, might use the legacy publishing mode and LIVE virtual schema. See LIVE schema (legacy).

Requirements

To create streaming tables and materialized views in a target schema in Unity Catalog, you must have the following permissions on the schema and parent catalog:

USE CATALOG privileges on the target catalog.
CREATE MATERIALIZED VIEW and USE SCHEMA privileges on the target schema if your pipeline creates materialized views.
CREATE TABLE and USE SCHEMA privileges on the target schema if your pipeline creates streaming tables.
If your pipeline creates new schemas, you must have USE CATALOG and CREATE SCHEMA privileges on the target catalog.

Compute requirements to run a Unity Catalog-enabled pipeline:

Your compute resource must be configured with standard access mode. Dedicated compute is not supported. See Access modes.

Compute required to query tables that are created by pipelines using Unity Catalog (including streaming tables and materialized views) includes any of the following:

SQL warehouses
Standard access mode compute on Databricks Runtime 13.3 LTS or above.
Dedicated access mode compute, if fine-grained access control is enabled on the dedicated compute (that is, it is running on Databricks Runtime 15.4 or above and serverless compute is enabled for the workspace). For more information, see Fine-grained access control on dedicated compute.
Dedicated access mode compute on 13.3 LTS through 15.3, only if the table owner runs the query.

Additional compute limitations apply. See the section that follows.

Limitations

The following are limitations when using Unity Catalog with pipelines:

By default, only the pipeline owner and workspace admins can view the driver logs from the compute that runs a Unity Catalog-enabled pipeline. To allow other users to access the driver logs, see Allow non-admin users to view the driver logs from a Unity Catalog-enabled pipeline.
Existing pipelines that use the Hive metastore cannot be upgraded to use Unity Catalog. To migrate an existing pipeline that writes to Hive metastore, you must create a new pipeline and re-ingest data from the data source(s). See Create a Unity Catalog pipeline by cloning a Hive metastore pipeline.

You cannot create a Unity Catalog-enabled pipeline in a workspace attached to a metastore that was created during the Unity Catalog Public Preview. See Upgrade to privilege inheritance.

JARs are not supported. Only third-party Python libraries are supported. See Manage Python dependencies for pipelines.
Data manipulation language (DML) queries that modify the schema of a streaming table are not supported.
A materialized view created in a pipeline cannot be used as a streaming source outside of that pipeline, for example, in another pipeline or a downstream notebook.
Data for materialized views and streaming tables are stored in the storage location for the containing schema. If a schema storage location is not specified, tables are stored in the catalog storage location. If schema and catalog storage locations are not specified, tables are stored in the root storage location of the metastore.
The Catalog Explorer History tab does not show history for materialized views.
The LOCATION property is not supported when defining a table.
Unity Catalog-enabled pipelines cannot publish to the Hive metastore.
Global init scripts are not supported. Databricks recommends using the pipeline Environment settings to install dependencies. On classic compute, you can use cluster-scoped init scripts, but Databricks recommends the Environment settings instead. Serverless pipelines do not support init scripts. See Manage Python dependencies for pipelines.

Python UDF support is in Public Preview.

note

The underlying files supporting materialized views might include data from upstream tables (including possible personally identifiable information) that do not appear in the materialized view definition. This data is automatically added to the underlying storage to support incremental refreshing of materialized views.

Because the underlying files of a materialized view might risk exposing data from upstream tables not part of the materialized view schema, Databricks recommends not sharing the underlying storage with untrusted downstream consumers.

For example, suppose a materialized view definition includes a COUNT(DISTINCT field_a) clause. Even though the materialized view definition only includes the aggregate COUNT DISTINCT clause, the underlying files contain a list of the actual values of field_a.

Use Hive metastore and Unity Catalog pipelines together

Your workspace can contain pipelines that use Unity Catalog and the legacy Hive metastore. However, a single pipeline cannot write to the Hive metastore and Unity Catalog. Existing pipelines that write to the Hive metastore cannot be upgraded to use Unity Catalog. To migrate an existing pipeline that writes to Hive metastore, you must create a new pipeline and re-ingest data from the data source(s). See Create a Unity Catalog pipeline by cloning a Hive metastore pipeline.

Existing pipelines not using Unity Catalog are unaffected by creating new pipelines configured with Unity Catalog. These pipelines continue to persist data to the Hive metastore using the configured storage location.

Unless specified otherwise in this document, all existing data sources and pipeline functionality are supported with pipelines that use Unity Catalog. Both the Python and SQL interfaces are supported with pipelines that use Unity Catalog.

Inactive tables

When a pipeline is configured to persist data to Unity Catalog, the pipeline manages the lifecycle and permissions of the table.

Tables can become inactive if their definition is removed from a pipeline. The next pipeline update marks the corresponding materialized view or streaming table entry as inactive.

If you change the pipeline’s default catalog or schema and do not use fully qualified table names in the pipeline source code, the next pipeline run creates the materialized view or streaming table in the new catalog or schema, and the previous materialized view or streaming table in the old location is marked as inactive.

You can still query inactive tables, but the pipeline no longer updates them. To clean up materialized views or streaming tables, explicitly DROP the table. Inactive tables are deleted when the pipeline is deleted.

You can recover dropped tables within 7 days by using the UNDROP command.
To retain the legacy behavior where the materialized view or streaming table entry is removed from Unity Catalog on the next pipeline update, set the pipeline configuration "pipelines.dropInactiveTables": "true". The actual data is retained for a period so that it can be recovered if deleted by mistake. The data can be recovered within 7 days by adding the materialized view or streaming table back into the pipeline definition.

Deleting the pipeline entirely (as opposed to removing a table definition from the pipeline source) also deletes all tables defined in that pipeline. The UI prompts you to confirm the deletion of a pipeline.

Delete a pipeline

When you delete a Unity Catalog pipeline, the associated materialized views, streaming tables, and views are also deleted.

To delete a pipeline and retain its tables, use the cascade field in the API. The retained tables are inactive, but can be queried. You can move inactive tables to a new pipeline, and if attached to a flow, they are reactivated. See Move tables between pipelines.

DELETE /api/2.0/pipelines/{pipeline_id}?cascade=false

See Delete a pipeline in the Databricks REST API documentation.

Write tables to Unity Catalog from a pipeline

To write your tables to Unity Catalog, you must configure your pipeline to work with it through your workspace. When you create a pipeline, select Unity Catalog under Storage options, select a catalog in the Catalog drop-down menu, and select an existing schema or enter the name for a new schema in the Target schema drop-down menu. To learn about Unity Catalog catalogs, see What are catalogs in Databricks?. To learn about schemas in Unity Catalog,see What are schemas in Databricks?.

note

When a pipeline publishes to Unity Catalog, Databricks stores some backing data in the reserved __databricks_internal catalog. This is expected. See The __databricks_internal catalog.

Ingest data into a Unity Catalog pipeline

Your pipeline configured to use Unity Catalog can read data from:

Unity Catalog managed and external tables, views, materialized views and streaming tables.
Hive metastore tables and views.
Auto Loader using the read_files() function to read from Unity Catalog external locations.
Apache Kafka and Amazon Kinesis.

The following are examples of reading from Unity Catalog and Hive metastore tables.

Batch ingestion from a Unity Catalog table

SQL
Python

SQL
CREATE OR REFRESH MATERIALIZED VIEW
  table_name
AS SELECT
  *
FROM
  my_catalog.my_schema.table1;

Python
@dp.materialized_view
def table_name():
  return spark.read.table("my_catalog.my_schema.table")

Stream changes from a Unity Catalog table

SQL
Python

SQL
CREATE OR REFRESH STREAMING TABLE
  table_name
AS SELECT
  *
FROM
  STREAM(my_catalog.my_schema.table1);

Python
@dp.table
def table_name():
  return spark.readStream.table("my_catalog.my_schema.table")

Ingest data from Hive metastore

A pipeline that uses Unity Catalog can read data from Hive metastore tables using the hive_metastore catalog:

SQL
Python

SQL
CREATE OR REFRESH MATERIALIZED VIEW
  table_name
AS SELECT
  *
FROM
  <hms_federation_catalog>.some_schema.table;

Python
@dp.materialized_view
def table3():
  return spark.read.table("<hms_federation_catalog>.some_schema.table")

Ingest data from Auto Loader

SQL
Python

SQL
CREATE OR REFRESH STREAMING TABLE table_name
AS SELECT *
FROM STREAM read_files(
  "/path/to/uc/external/location",
  format => "json"
)

Python
@dp.table(table_properties={"quality": "bronze"})
def table_name():
  return (
     spark.readStream.format("cloudFiles")
     .option("cloudFiles.format", "json")
     .load(f"{path_to_uc_external_location}")
 )

By default, only the pipeline owner has permission to query datasets created by the pipeline. You can give other users the ability to query a table by using GRANT statements and you can revoke query access using REVOKE statements. For more information about privileges in Unity Catalog, see Manage privileges in Unity Catalog.

Grant select on a table

SQL
GRANT SELECT ON TABLE
  my_catalog.my_schema.table_name
TO
  `user@databricks.com`

Revoke select on a table

SQL
REVOKE SELECT ON TABLE
  my_catalog.my_schema.table_name
FROM
  `user@databricks.com`

Grant create table or create materialized view privileges

SQL
GRANT CREATE { MATERIALIZED VIEW | TABLE } ON SCHEMA
  my_catalog.my_schema
TO
  { principal | user }

View lineage for a pipeline

Lineage for tables defined in pipelines is visible in Catalog Explorer. The Catalog Explorer lineage UI shows the upstream and downstream tables for materialized views or streaming tables in a Unity Catalog-enabled pipeline. To learn more about Unity Catalog lineage, see Lineage in Unity Catalog.

For a materialized view or streaming table in a Unity Catalog-enabled pipeline, the Catalog Explorer lineage UI also links to the pipeline that produced the materialized view or streaming table if the pipeline is accessible from the current workspace.

Add, change, or delete data in a streaming table

You can use data manipulation language (DML) statements, including insert, update, delete, and merge statements, to modify streaming tables published to Unity Catalog. Support for DML queries against streaming tables enables use cases such as updating tables for compliance with the General Data Protection Regulation (GDPR).

note

DML statements that modify the table schema of a streaming table are not supported. Ensure that your DML statements do not attempt to evolve the table schema.
DML statements that update a streaming table can be run only in a shared Unity Catalog cluster or a SQL warehouse using Databricks Runtime 13.3 LTS and above.
Because streaming requires append-only data sources, if your processing requires streaming from a source streaming table with changes (for example, by DML statements), set the skipChangeCommits flag when reading the source streaming table. When skipChangeCommits is set, transactions that delete or modify records on the source table are ignored. If your processing does not require a streaming table, you can use a materialized view (which does not have the append-only restriction) as the target table.

The following are examples of DML statements to modify records in a streaming table.

Delete records with a specific ID:

SQL
DELETE FROM my_streaming_table WHERE id = 123;

Update records with a specific ID:

SQL
UPDATE my_streaming_table SET name = 'Jane Doe' WHERE id = 123;

Publish tables with row filters and column masks

Row filters let you specify a function that applies as a filter whenever a table scan fetches rows. These filters ensure that subsequent queries only return rows for which the filter predicate evaluates to true.

Column masks let you mask a column's values whenever a table scan fetches rows. Future queries for that column return the evaluated function's result instead of the column's original value. For more information on using row filters and column masks, see Row filters and column masks.

Manage row filters and column masks

Row filters and column masks on materialized views and streaming tables should be added, updated, or dropped through the CREATE OR REFRESH statement.

For detailed syntax on defining tables with row filters and column masks, see Pipeline SQL language reference and Lakeflow pipelines Python language reference.

Row filter and column mask behavior

The following are important details when using row filters or column masks in a pipeline:

Refresh as owner: When a pipeline update refreshes a materialized view or streaming table, row filter and column mask functions run with the pipeline owner's rights. This means the table refresh uses the security context of the user who created the pipeline. Functions that check user context (such as CURRENT_USER and IS_MEMBER) are evaluated using the pipeline owner's user context.
Query: When querying a materialized view or streaming table, functions that check user context (such as CURRENT_USER and IS_MEMBER) are evaluated using the invoker's user context. This approach enforces user-specific data security and access controls based on the current user's context.
When creating materialized views over source tables that contain row filters and column masks, the refresh of the materialized view is always a full refresh. A full refresh reprocesses all data available in the source with the latest definitions. This process checks that security policies on the source tables are evaluated and applied with the most up-to-date data and definitions.

Audit row filters and column masks

Use DESCRIBE EXTENDED, INFORMATION_SCHEMA, or the Catalog Explorer to examine the existing row filters and column masks that apply to a given materialized view or streaming table. This functionality allows users to audit and review data access and protection measures on materialized views and streaming tables.

Requirements​

Limitations​

Use Hive metastore and Unity Catalog pipelines together​

Inactive tables​

Delete a pipeline​

Write tables to Unity Catalog from a pipeline​

Ingest data into a Unity Catalog pipeline​

Batch ingestion from a Unity Catalog table​

Stream changes from a Unity Catalog table​

Ingest data from Hive metastore​

Ingest data from Auto Loader​

Share materialized views​

Grant select on a table​

Revoke select on a table​

Grant create table or create materialized view privileges​

View lineage for a pipeline​

Add, change, or delete data in a streaming table​

Delete records with a specific ID:​

Update records with a specific ID:​

Publish tables with row filters and column masks​

Manage row filters and column masks​

Row filter and column mask behavior​

Audit row filters and column masks​

Requirements

Limitations

Use Hive metastore and Unity Catalog pipelines together

Inactive tables

Delete a pipeline

Write tables to Unity Catalog from a pipeline

Ingest data into a Unity Catalog pipeline

Batch ingestion from a Unity Catalog table

Stream changes from a Unity Catalog table

Ingest data from Hive metastore

Ingest data from Auto Loader

Share materialized views

Grant select on a table

Revoke select on a table

Grant create table or create materialized view privileges

View lineage for a pipeline

Add, change, or delete data in a streaming table

Delete records with a specific ID:

Update records with a specific ID:

Publish tables with row filters and column masks

Manage row filters and column masks

Row filter and column mask behavior

Audit row filters and column masks