Enable external data access to streaming tables and materialized views
This feature is in Public Preview.
If you have enabled external data access to Unity Catalog, then you can additionally add external data access to your pipeline datasets. This enables external Delta and Iceberg clients to access your datasets through the Unity Catalog and Iceberg catalog REST APIs, without requiring a full data copy.
External data access for pipeline datasets works for Lakeflow Spark Declarative Pipelines.
Capabilities
Using external data access for pipeline datasets exposes the same data available in Databricks, without creating a duplicate of the data. This gives the following characteristics for performance and functionality:
- No data copy required: External access is enabled without duplicating the full dataset.
- External access via APIs: Read materialized views and streaming tables using Delta Lake or Iceberg APIs.
- Read-after-write consistency: External readers can access up-to-date data after an update to the dataset, ensuring no staleness. Updates are available immediately upon refresh.
- Single table object: Datasets appear externally as managed tables with the same name as the source dataset within Unity Catalog APIs.
- Low cost: Because the full dataset is not copied, the overhead for providing external access is low.
Requirements
The requirements for your datasets are:
- External access must be enabled on the schema: Your workspace must be enrolled in the External data access for pipeline datasets public preview, and it must be enabled for the schema with your datasets. See Enable external data access to Unity Catalog.
- Unity Catalog: Your streaming tables and materialized views must be using Unity Catalog.
- Databricks Runtime version: You must be using Databricks Runtime 17.3 and above.
The requirements for your clients are:
- Delta API version: The client must support Delta Lake APIs 4.0.0 or above, including deletion vectors, and must use the Unity Catalog catalog APIs for access.
- Iceberg API version: Alternatively, the client can access using Iceberg catalog APIs that support the Iceberg v3 specification.
- Unity Catalog privileges: The principal reading the datasets externally must have the EXTERNAL USE SCHEMA privilege on the schema, and
SELECTprivilege on the table.
If your client does not support these requirements, you can also use compatibility mode, which supports all Delta and Iceberg clients, but requires creating a full copy of the dataset.
How to enable access for a dataset
There are three steps to enable external access for a dataset.
-
In your dataset definition, add the following
TBLPROPERTIES. This is only required for Iceberg v3 readers. If you only have Delta readers, you can skip this step.Property
Use
'delta.columnMapping.mode' = 'name'Column mapping is required for Iceberg.
'delta.universalFormat.enabledFormats' = 'iceberg'Enable UniForm for Iceberg.
'delta.enableIcebergCompatV3' = 'true'Use Iceberg V3 for UniForm.
'delta.enableChangeDataFeed' = 'false'Change data feed is not compatible with external access, so this must be
false.For example, you can update the definition of a materialized view in Lakeflow Spark Declarative Pipelines by adding the following
TBLPROPERTIESto your query:SQLCREATE OR REFRESH MATERIALIZED VIEW view_name
TBLPROPERTIES(
...
'delta.columnMapping.mode' = 'name',
'delta.enableIcebergCompatV3' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg',
'delta.enableChangeDataFeed' = 'false')
...To see the properties of your dataset, you can use the
DESCRIBE EXTENDEDSQL statement. -
Apply the Iceberg properties to the pipeline. This is only required for Iceberg v3 readers. If you only have Delta readers, you can skip this step.
- Triggered pipelines: Run the pipeline once.
- Continuous pipelines: Stop and restart the pipeline.
-
In your pipeline configuration, set
pipelines.externalMetadata.enabledtotrue.- Pipeline settings UI
- Pipeline configuration JSON
- Open your pipeline and click Settings.
- Under Configuration, add a key-value pair: Key
pipelines.externalMetadata.enabled, Valuetrue. - Click Save.
In the
configurationsection of your pipeline JSON, add:JSON{
"configuration": {
"pipelines.externalMetadata.enabled": "true"
}
}After saving the configuration, run or restart the pipeline to apply the changes:
- Triggered pipelines: Run the pipeline once.
- Continuous pipelines: Stop and restart the pipeline.
Reading data from external clients
The following sections describe how to read your dataset from different clients and environments.
Use Unity REST API with the Spark Delta Reader
Use Apache Spark™ version 4.0 or later. You can download from https://spark.apache.org/downloads.html.
-
Based on your cloud provider, run the following command to start a Spark SQL shell with Delta 4.0 and Unity Catalog.
- AWS
- Azure
- GCP
Shellbin/spark-sql \
--packages org.apache.spark:spark-hadoop-cloud_2.13:4.0.0,io.unitycatalog:unitycatalog-spark_2.13:0.3.1 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.sql.catalog.<uc-catalog-name>=io.unitycatalog.spark.UCSingleCatalog \
--conf spark.sql.catalog.<uc-catalog-name>.uri=<workspace_url> \
--conf spark.sql.catalog.<uc-catalog-name>.token=<PAT> \
--conf spark.sql.defaultCatalog=<uc-catalog-name>Shellbin/spark-sql \
--packages org.apache.hadoop:hadoop-azure:3.3.6,io.unitycatalog:unitycatalog-spark_2.13:0.3.1 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog \
--conf spark.sql.catalog.<uc-catalog-name>=io.unitycatalog.spark.UCSingleCatalog \
--conf spark.sql.catalog.<uc-catalog-name>.uri=<workspace_url> \
--conf spark.sql.catalog.<uc-catalog-name>.token=<PAT> \
--conf spark.sql.defaultCatalog=<uc-catalog-name>Shellbin/spark-sql \
--packages io.unitycatalog:unitycatalog-spark_2.13:0.3.1 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog \
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
--conf spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS \
--conf spark.sql.catalog.<uc-catalog-name>=io.unitycatalog.spark.UCSingleCatalog \
--conf spark.sql.catalog.<uc-catalog-name>.uri=<workspace_url> \
--conf spark.sql.catalog.<uc-catalog-name>.token=<PAT> \
--conf spark.sql.defaultCatalog=<uc-catalog-name> -
From the SQL shell, you can now access your dataset with Spark SQL. For example:
Shellspark-sql ()> SELECT * FROM <uc-catalog>.<uc-schema>.<uc-table-name>;
Use the Snowflake Iceberg Reader
Within Snowflake, you can use the Iceberg Reader. This requires Iceberg v3 support, which is currently in private preview in Snowflake.
-
Set up the Iceberg REST catalog in Apache Spark.
Shellbin/spark-shell \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.0,org.apache.iceberg:iceberg-aws-bundle:1.8.0 \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf spark.sql.catalog.<uc-catalog-name>=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.<uc-catalog-name>.type=rest \
--conf spark.sql.catalog.<uc-catalog-name>.uri=<workspace-url>/api/2.1/unity-catalog/iceberg-rest \
--conf spark.sql.catalog.<uc-catalog-name>.token=<PAT> \
--conf spark.sql.catalog.<uc-catalog-name>.warehouse=<uc-catalog-name> -
Set up the Iceberg REST catalog in Snowflake.
SQLCREATE OR REPLACE CATALOG INTEGRATION my_uc_int
CATALOG_SOURCE = ICEBERG_REST
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = '<uc-schema-name>'
REST_CONFIG = (
CATALOG_URI = '<workspace-url>/api/2.1/unity-catalog/iceberg-rest'
CATALOG_NAME = '<uc-catalog-name>'
ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS
)
REST_AUTHENTICATION = (
TYPE = BEARER
BEARER_TOKEN = '<PAT>'
)
ENABLED = TRUE;
CREATE OR REPLACE ICEBERG TABLE my_table
CATALOG = 'my_uc_int'
CATALOG_TABLE_NAME = '<uc-table-name>'; -
Access your dataset from Spark SQL.
Shellspark-sql ()> SELECT * FROM <uc-catalog>.<uc-schema>.<uc-table-name>;
Use the Iceberg REST catalog with Spark Iceberg reader
Use Apache Spark™ version 4.0 or later. You can download from https://spark.apache.org/downloads.html.
-
In AWS, run the following command to start a Spark SQL shell with Iceberg v3.
Shellbin/spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0,org.apache.iceberg:iceberg-aws-bundle:1.10.0 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.<uc-catalog-name>=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.<uc-catalog-name>.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.<uc-catalog-name>.type=rest \
--conf spark.sql.catalog.<uc-catalog-name>.uri=<workspace_url>/api/2.1/unity-catalog/iceberg-rest \
--conf spark.sql.catalog.<uc-catalog-name>.token='<PAT>' \
--conf spark.sql.catalog.<uc-catalog-name>.warehouse=<uc-catalog-name> \
--conf spark.sql.iceberg.vectorization.enabled=false -
Access your dataset from Spark SQL.
Shellspark-sql ()> SELECT * FROM <uc-catalog>.<uc-schema>.<uc-table-name>;
Migrate from compatibility mode
If you are currently sharing a dataset using compatibility mode, you can migrate to using external data access.
- Enable this feature following the steps in How to enable access for a dataset.
- Disable compatibility mode. See Disable Compatibility Mode
Limitations
The following are known limitations with external data access for streaming tables and materialized views.
- External Writes: External writes to pipeline datasets are not supported.
- Path-Based Access: External readers that require path-based access (reading directly through a storage location instead of the UC API interface) are not supported. To support path-based access, you can use compatibility mode, which does support path-based access, but requires a full copy of the dataset.
- Security Features: Supporting row-level security or column level masking from external reads is not supported.
- Time Travel or CDF: Supporting time travel or change data feed (CDF) via this feature is not supported. CDF must be disabled when UniForm Iceberg is enabled.
- Catalog commits (beta): Catalog commits are not compatible with external data access. To use external data access on a streaming table, you must first disable catalog commits. Catalog commits are not available for materialized views.
- Ingestion pipelines: Streaming tables created with Lakeflow Connect do not support enabling Iceberg table properties, and are only available with Delta readers.
- Fabric: Reading from Microsoft Fabric is not supported.
- Snowflake Iceberg reader: You must be using the Iceberg v3 reader in Snowflake (Private Preview) to read pipeline datasets.
- Standalone MVs and STs: This feature is only supported for materialized views and streaming tables managed by a pipeline. Standalone materialized views and streaming tables are not supported. Contact your Databricks account team if you need external access for standalone materialized views and streaming tables.