Using Unity Catalog with Structured Streaming

You can use Structured Streaming with Unity Catalog to manage data governance for your incremental and streaming workloads on Databricks. This document outlines supported functionality and limitations, as well as providing some recommended best practices for using Unity Catalog and Structured Streaming together.

What Structured Streaming functionality does Unity Catalog support?

Unity Catalog does not add any explicit limits for Structured Streaming sources and sinks available on Databricks. The Unity Catalog data governance model allows you to stream data from managed and external tables in Unity Catalog. You can also use external locations managed by Unity Catalog to interact with data using object storage URIs. You can write to external tables using either table names or file paths; you can only interact with managed tables on Unity Catalog using the table name.

You can use external locations managed by Unity Catalog when specifying paths for Structured Streaming checkpoints. To learn more about securely connecting storage with Unity Catalog, see Manage external locations and storage credentials.

You can use external locations managed by Unity Catalog when specifying paths for Structured Streaming checkpoints.

For both interactive notebooks and scheduled jobs, you must use single user clusters for Structured Streaming on Unity Catalog. Python and Scala are supported.

To walk through an end-to-end demo using Structured Streaming on Unity Catalog, see Run your first end-to-end analytics pipeline in the Databricks Lakehouse.

What Structured Streaming functionality is disabled on Unity Catalog?

Unity Catalog does not support some Structured Streaming features, including:

  • Continuous streaming mode

  • Asynchronous checkpointing

  • Using display() with Structured Streaming queries

How to structure your checkpoint and table files for Unity Catalog

Unity Catalog does not allow you to nest checkpoint files under the table directory. Databricks recommends storing Structured Streaming checkpoint in Unity Catalog external locations. Because Unity Catalog always checks permissions to access external tables using the cloud URI, you can safely store checkpoint data adjacent to table data in external locations. For example:

s3:/my-bucket/table1/
s3:/my-bucket/table1.checkpoint/
s3:/my-bucket/table2/
s3:/my-bucket/table2.checkpoint/

Caveats and recommendations for long-running streams

Unity Catalog introduces new models for managing temporary tokens to access data sources and sinks. Because all tokens are ephemeral, long-running Structured Streaming workloads will fail after a set period of time.

Streams run against all purpose or job clusters will fail after 30 days of continuous execution. For long-running workloads, Databricks recommends running Structured Streaming queries in jobs configured to retry automatically on failure. For more details, see Configure Structured Streaming jobs to restart streaming queries on failure.