Read data shared using Databricks-to-Databricks Delta Sharing
This article describes how to read data that has been shared with you using the Databricks-to-Databricks Delta Sharing protocol, in which Databricks manages a secure connection for data sharing. Unlike the Delta Sharing open sharing protocol, the Databricks-to-Databricks protocol does not require a credential file (token-based security).
Note
If data has been shared with you using the Delta Sharing open sharing protocol, see Read data shared using Delta Sharing open sharing.
How do I make shared data available to my team?
To read shared data that has been shared with you using the Databricks-to-Databricks protocol, you must be a user on a Databricks workspace that is enabled for Unity Catalog. A member of your team provides the data provider with a unique identifier for your Databricks workspace, and the data provider uses that identifier to create a secure sharing connection with your organization. The shared data then becomes available for read access in your workspace, and any updates that the data provider makes to the shared tables and partitions are updated in your workspace in near real time.
Note
Updates to shared data tables appear in your workspace in near real time. However, column changes (adding, renaming, deleting) may not display in Data Explorer for one minute. Likewise, new shares and updates to shares (such as adding new tables to a share) are cached for one minute before they are available for you to view and query.
To read data that has been shared with you:
You or another user on your team finds the share—the container for the tables that have been shared with you—and use that share to create a catalog—the top-level container for all data in Databricks Unity Catalog.
You or another user on your team grants or denies access to the catalog and the objects inside the catalog (schemas and tables) to other members of your team.
You read the data in the tables that you have been granted access to just like any other table in Databricks that you have read-only (
SELECT
) access to.
View providers and shares
To start reading the data that has been shared with you by a data provider, you need to know the name of the provider and share objects that are stored in your Unity Catalog metastore once the provider has shared data with you.
The provider object represents the Unity Catalog metastore, cloud platform, and region of the organization that shared the data with you.
The share object represents the tables that the provider has shared with you.
View all providers who have shared data with you
To view a list of available data providers, you can use Data Explorer, the Databricks Unity Catalog CLI, or the SHOW PROVIDERS
SQL command in a Databricks notebook or the Databricks SQL query editor.
For details, see View providers.
View provider details
To view details about a provider, you can use Data Explorer, the Databricks Unity Catalog CLI, or the DESCRIBE PROVIDER
SQL command in a Databricks notebook or the Databricks SQL query editor.
For details, see View provider details.
View shares
To view the shares that a provider has shared with you, you can use Data Explorer, the Databricks Unity Catalog CLI, or the SHOW SHARES IN PROVIDER
SQL command in a Databricks notebook or the Databricks SQL query editor.
For details, see View shares that a provider has shared with you.
Access data in a shared table
To read data in a shared table:
A privileged user must create a catalog from the share that contains the table. This can be a metastore admin or a user who has both the
CREATE_CATALOG
privilege for your Unity Catalog metastore and ownership of the provider object.That user or a user with the same privileges must grant you access to the shared table.
You can access the table just as you would any other table registered in your Unity Catalog metastore.
Create a catalog from a share
To access the data in a share, you must create a catalog from the share. To create a catalog from a share, you can use Data Explorer, the Databricks Unity Catalog CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor.
Permissions required: Metastore admin or a user who has both the CREATE_CATALOG
privilege for your Unity Catalog metastore and ownership of the provider object.
In your Databricks workspace, click
Data.
In the left pane, expand the Delta Sharing menu and select Shared with me.
On the Providers tab, select the provider.
On the Shares tab, find the share and click Create catalog on the share row.
Enter a name for the catalog and optional comment.
Click Create.
Run the following command in a notebook or the Databricks SQL query editor.
CREATE CATALOG [IF NOT EXISTS] <catalog-name>
USING SHARE <provider_name>.<share_name>;
databricks unity-catalog catalogs create --name <catalog_name> /
--provider <provider_name> /
--share <share_name>
The catalog created from a share has a catalog type of Delta Sharing. You can view the type on the catalog details page in Data Explorer or by running the DESCRIBE CATALOG SQL command in a notebook or Databricks SQL query.
A Delta Sharing catalog can be managed in the same way as regular catalogs on a Unity Catalog metastore. You can view, update, and delete a Delta Sharing catalog using Data Explorer, the Databricks CLI, and by using SHOW CATALOGS
, DESCRIBE CATALOG
, ALTER CATALOG
, and DROP CATALOG
SQL commands.
The 3-level namespace structure under a Delta Sharing catalog created from a share is the same as the one under a regular catalog on Unity Catalog: catalog.schema.table
.
Table data under a shared catalog is read-only, which means you can perform read operations like DESCRIBE
, SHOW
, and SELECT.
Manage permissions for the schemas and tables in a Delta Sharing catalog
By default, the catalog creator is the owner of all data objects under a Delta Sharing catalog and can manage permissions for any of them.
Privileges are inherited downward, although some workspaces may still be on the legacy security model that did not provide inheritance. See Inheritance model. Any user granted the SELECT
privilege on the catalog will have the SELECT
privilege on all of the schemas and tables in the catalog unless that privilege is revoked. You cannot grant privileges that give write or update access to a Delta Sharing catalog or objects in a Delta Sharing catalog.
The catalog owner can delegate the ownership of data objects to other users or groups, thereby granting those users the ability to manage the object permissions and life cycles.
For detailed information about managing privileges on data objects using Unity Catalog, see Manage privileges in Unity Catalog.
Read data in a shared table
You can read data in a shared table using any of the tools available to you as a Databricks user: Data Explorer, notebooks, SQL queries, the Databricks CLI, and Databricks REST APIs.
Query a table’s history data
If history is shared along with the table, you can query the table data as of a version or timestamp. Requires Databricks Runtime 12.1 or above.
For example:
SELECT * FROM vaccine.vaccine_us.vaccine_us_distribution VERSION AS OF 3;
SELECT * FROM vaccine.vaccine_us.vaccine_us_distribution TIMESTAMP AS OF "2023-01-01 00:00:00";
In addition, if the change data feed (CDF) is enabled with the table, you can query the CDF. Both version and timestamp are supported:
SELECT * FROM table_changes('vaccine.vaccine_us.vaccine_us_distribution', 0, 3);
SELECT * FROM table_changes('vaccine.vaccine_us.vaccine_us_distribution', "2023-01-01 00:00:00", "2022-02-01 00:00:00");
For more information about change data feed, see Use Delta Lake change data feed on Databricks.
Query a table using Apache Spark Structured Streaming
If a table is shared with history, you can use it as the source for Spark Structured Streaming. Requires Databricks Runtime 12.1 or above.
Supported options:
ignoreDeletes
: Ignore transactions that delete data.ignoreChanges
: Re-process updates if files were rewritten in the source table due to a data changing operation such asUPDATE
,MERGE INTO
,DELETE
(within partitions), orOVERWRITE
. Unchanged rows can still be emitted. Therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream.ignoreChanges
subsumesignoreDeletes
. Therefore, if you useignoreChanges
, your stream will not be disrupted by either deletions or updates to the source table.startingVersion
: The shared table version to start from. All table changes starting from this version (inclusive) will be read by the streaming source.startingTimestamp
: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Example:"2023-01-01 00:00:00.0"
maxFilesPerTrigger
: The number of new files to be considered in every micro-batch.maxBytesPerTrigger
: The amount of data that gets processed in each micro-batch. This option sets a “soft max”, meaning that a batch processes approximately this amount of data and might process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit.readChangeFeed
: Stream read the change data feed of the shared table.
spark.readStream.format("deltaSharing")
.option("startingVersion", 0)
.option("ignoreChanges", true)
.option("maxFilesPerTrigger", 10)
.table("vaccine.vaccine_us.vaccine_us_distribution")
spark.readStream.format("deltaSharing")\
.option("startingVersion", 0)\
.option("ignoreDeletes", true)\
.option("maxBytesPerTrigger", 10000)\
.table("vaccine.vaccine_us.vaccine_us_distribution")
If change data feed (CDF) is enabled with the table, you can stream read the CDF.
spark.readStream.format("deltaSharing")
.option("readChangeFeed", "true")
.table("vaccine.vaccine_us.vaccine_us_distribution")