Read data shared using Delta Sharing open sharing

This article describes how to read data that has been shared with you using the Delta Sharing open sharing protocol. In open sharing, you use a credential file that was shared with a member of your team by the data provider to gain secure read access to shared data. Access persists until the provider stops sharing the data with you. Updates to the data are available to you in near real time. You can read and make copies of the shared data, but you can’t modify the source data.

Note

If data has been shared with you using Databricks-to-Databricks Delta Sharing, you don’t need a credential file to access data. For instructions, see Read data shared using Databricks-to-Databricks Delta Sharing.

The sections that follow describe how to access and read shared data using the credential file in Databricks, Apache Spark, pandas, and Power BI. For a full list of Delta Sharing connectors and information about how to use them, see the Delta Sharing open source documentation. If you run into trouble accessing the shared data, contact the data provider.

Note

Partner integrations are, unless otherwise noted, provided by the third parties and you must have an account with the appropriate provider for the use of their products and services. While Databricks does its best to keep this content up to date, we make no representation regarding the integrations or the accuracy of the content on the partner integration pages. Reach out to the appropriate providers regarding the integrations.

Requirements

A member of your team must download the credential file shared by the data provider. See Get access in the open sharing model. They should use a secure channel to share that file or file location with you.

Databricks: Read shared data using Unity Catalog

If your workspace is enabled for Unity Catalog, you can use the Databricks Unity Catalog CLI to create a provider object in your Unity Catalog metastore. You can then manage and access the shared data using the same procedures as a recipient whose shares were made available to them using Databricks-to-Databricks sharing.

Note

If your workspace is enabled for Unity Catalog and the data provider shared data with you without providing a credential file, that provider is likely using Databricks-to-Databricks sharing, and you should go instead to the instructions in Read data shared using Databricks-to-Databricks Delta Sharing.

Permissions required: Metastore admin or user with the CREATE_PROVIDER privilege for the metastore.

To create the provider, you must have access to the downloaded credential file. See Get access in the open sharing model.

Run the following command using the Databricks CLI, replacing <provider_name> with the name you want to give to the provider and config.share with the path to your downloaded credential file.

databricks unity-catalog providers create --name <provider_name>  \
                  --recipient-profile-json-file config.share

You can now view the provider, view the shares the provider has shared with you, and access data in those shares using Data Explorer, the Databricks Unity Catalog CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor, without having to reference a credentials file directly. For detailed instructions, see the following:

Databricks: Read shared data using open sharing connectors

This section describes how to use an open sharing connector to access shared data using a notebook in your Databricks workspace. You or another member of your team store the credential file in DBFS, then you use it to authenticate to the data provider’s Databricks account and read the data that the data provider shared with you.

Note

There are two other options for reading shared data using Databricks:

In this example, you create a notebook with multiple cells that you can run independently. You could instead add the notebook commands to the same cell and run them in sequence.

Step 1: Store the credential file in DBFS (Python instructions)

In this step, you use a Python notebook in Databricks to store the credential file so that users on your team can access shared data.

Skip to the next step if you or someone on your team has already stored the credential file in DBFS.

  1. In a text editor, open the credential file.

  2. In your Databricks workspace, click New > Notebook.

    • Enter a name.

    • Set the default language for the notebook to Python.

    • Select a cluster to attach to the notebook. Select a cluster that runs Databricks Runtime 8.4 or above or a cluster with the Apache Spark connector library installed. For more information about installing cluster libraries, see Libraries.

    • Click Create.

    The notebook opens in the notebook editor.

  3. To use Python or pandas to access the shared data, install the delta-sharing Python connector. In the notebook editor, paste the following command:

    %sh pip install delta-sharing
    
  4. Run the cell.

    The delta-sharing Python library gets installed in the cluster if it isn’t already installed.

  5. In a new cell, paste the following command, which uploads the contents of the credential file to a folder in DBFS. Replace the variables as follows:

    • <dbfs-path>: the path to the folder where you want to save the credential file

    • <credential-file-contents>: the contents of the credential file. This is not a path to the file, but the copied contents of the file.

      The credential file contains JSON that defines three fields: shareCredentialsVersion, endpoint, and bearerToken.

      %scala
      dbutils.fs.put("<dbfs-path>/config.share","""
      <credential-file-contents>
      """)
      
  6. Run the cell.

    After the credential file is uploaded, you can delete this cell. All workspace users can read the credential file from DBFS, and the credential file is available in DBFS on all clusters and SQL warehouses in your workspace. To delete the cell, click x in the cell actions menu Cell actions at the far right.

Step 2: Use a notebook to list and read shared tables

In this step, you list the tables in the share, or set of shared tables and partitions, and you query a table.

  1. Using Python, list the tables in the share.

    In a new cell, paste the following command. Replace <dbfs-path> with the path that was created in Step 1: Store the credential file in DBFS (Python instructions).

    When the code runs, Python reads the credential file from DBFS on the cluster. Access data stored in DBFS at the path /dbfs/.

    import delta_sharing
    
    client = delta_sharing.SharingClient(f"/dbfs/<dbfs-path>/config.share")
    
    client.list_all_tables()
    
  2. Run the cell.

    The result is an array of tables, along with metadata for each table. The following output shows two tables:

    Out[10]: [Table(name='example_table', share='example_share_0', schema='default'), Table(name='other_example_table', share='example_share_0', schema='default')]
    

    If the output is empty or doesn’t contain the tables you expect, contact the data provider.

  3. Query a shared table.

    • Using Scala:

      In a new cell, paste the following command. When the code runs, the credential file is read from DBFS through the JVM.

      Replace the variables as follows:

      • <profile_path>: the DBFS path of the credential file. For example, /<dbfs-path>/config.share.

      • <share_name>: the value of share= for the table.

      • <schema_name>: the value of schema= for the table.

      • <table_name>: the value of name= for the table.

      %scala
          spark.read.format("deltaSharing")
          .load("<profile_path>#<share_name>.<schema_name>.<table_name>").limit(10);
      

      Run the cell. Each time you load the shared table, you see fresh data from the source.

    • Using SQL:

      To query the data using SQL, you create a local table in the workspace from the shared table, then query the local table. The shared data is not stored or cached in the local table. Each time you query the local table, you see the current state of the shared data.

      In a new cell, paste the following command.

      Replace the variables as follows:

      • <local_table_name>: the name of the local table.

      • <profile_path>: the location of the credential file.

      • <share_name>: the value of share= for the table.

      • <schema_name>: the value of schema= for the table.

      • <table_name>: the value of name= for the table.

      %sql
      DROP TABLE IF EXISTS table_name;
      
      CREATE TABLE <local_table_name> USING deltaSharing LOCATION "<profile_path>#<share_name>.<schema_name>.<table_name>";
      
      SELECT * FROM <local_table_name> LIMIT 10;
      

      When you run the command, the shared data is queried directly. As a test, the table is queried and the first 10 results are returned.

    If the output is empty or doesn’t contain the data you expect, contact the data provider.

Apache Spark: Read shared data

Follow these steps to access shared data using Spark 3.x or above.

These instructions assume that you have access to the credential file that was shared by the data provider. See Get access in the open sharing model.

Install the Delta Sharing Python and Spark connectors

To access metadata related to the shared data, such as the list of tables shared with you, do the following. This example uses Python.

  1. Install the delta-sharing Python connector:

    pip install delta-sharing
    
  2. Install the Apache Spark connector.

List shared tables using Spark

List the tables in the share. In the following example, replace <profile_path> with the location of the credential file.

import delta_sharing

client = delta_sharing.SharingClient(f"<profile_path>/config.share")

client.list_all_tables()

The result is an array of tables, along with metadata for each table. The following output shows two tables:

Out[10]: [Table(name='example_table', share='example_share_0', schema='default'), Table(name='other_example_table', share='example_share_0', schema='default')]

If the output is empty or doesn’t contain the tables you expect, contact the data provider.

Access shared data using Spark

Run the following, replacing these variables:

  • <profile_path>: the location of the credential file.

  • <share_name>: the value of share= for the table.

  • <schema_name>: the value of schema= for the table.

  • <table_name>: the value of name= for the table.

  • <version_as_of>: optional. The version of the table to load the data. Only works if the data provider shares the CDF of the table.

delta_sharing.load_as_spark(f"<profile_path>#<share_name>.<schema_name>.<table_name>", version=<version_as_of>)

spark.read.format("deltaSharing")\
.option("versionAsOf", <version_as_of>)\
.load("<profile_path>#<share_name>.<schema_name>.<table_name>")\
.limit(10))

Run the following, replacing these variables:

  • <profile_path>: the location of the credential file.

  • <share_name>: the value of share= for the table.

  • <schema_name>: the value of schema= for the table.

  • <table_name>: the value of name= for the table.

  • <version_as_of>: optional. The version of the table to load the data. Only works if the data provider shares the CDF of the table.

spark.read.format("deltaSharing")
.option("versionAsOf", <version_as_of>)
.load("<profile_path>#<share_name>.<schema_name>.<table_name>")
.limit(10)

Access a shared change data feed using Spark

To access the change data feed for a shared table, if the change data feed has been shared with you, run the following, replacing these variables.

  • <profile_path>: the location of the credential file.

  • <share_name>: the value of share= for the table.

  • <schema_name>: the value of schema= for the table.

  • <table_name>: the value of name= for the table.

  • <starting_version>: required. The starting version of the query, inclusive. Specify as a Long.

  • <ending_version>: optional. The ending version of the query, inclusive. If the ending version is not provided, the API uses the latest table version.

  • <starting_timestamp>: required. The starting timestamp of the query, this is converted to a version created greater or equal to this timestamp. Specify as a string in the format yyyy-mm-dd hh:mm:ss[.fffffffff].

  • <ending_timestamp>: optional. The ending timestamp of the query, this is converted to a version created earlier or equal to this timestamp. Specify as a string in the format yyyy-mm-dd hh:mm:ss[.fffffffff]

delta_sharing.load_table_changes_as_spark(f"<profile_path>#<share_name>.<schema_name>.<table_name>",
  starting_version=<starting_version>,
  ending_version=<ending_version>)

delta_sharing.load_table_changes_as_spark(f"<profile_path>#<share_name>.<schema_name>.<table_name>",
  starting_timestamp=<starting_timestamp>,
  ending_timestamp=<ending_timestamp>)

spark.read.format("deltaSharing").option("readChangeFeed", "true")\
.option("statingVersion", <starting_version>)\
.option("endingVersion", <ending_version>)\
.load("<profile_path>#<share_name>.<schema_name>.<table_name>")

spark.read.format("deltaSharing").option("readChangeFeed", "true")\
.option("startingTimestamp", <starting_timestamp>)\
.option("endingTimestamp", <ending_timestamp>)\
.load("<profile_path>#<share_name>.<schema_name>.<table_name>")
spark.read.format("deltaSharing").option("readChangeFeed", "true")
.option("statingVersion", <starting_version>)
.option("endingVersion", <ending_version>)
.load("<profile_path>#<share_name>.<schema_name>.<table_name>")

spark.read.format("deltaSharing").option("readChangeFeed", "true")
.option("startingTimestamp", <starting_timestamp>)
.option("endingTimestamp", <ending_timestamp>)
.load("<profile_path>#<share_name>.<schema_name>.<table_name>")

If the output is empty or doesn’t contain the data you expect, contact the data provider.

Pandas: Read shared data

Follow these steps to access shared data in pandas 0.25.3 or above.

These instructions assume that you have access to the credential file that was shared by the data provider. See Get access in the open sharing model.

Install the Delta Sharing Python connector

To access metadata related to the shared data, such as the list of tables shared with you, you must install the delta-sharing Python connector.

pip install delta-sharing

List shared tables using pandas

To list the tables in the share, run the following, replacing <profile_path>/config.share with the location of the credential file.

import delta_sharing

client = delta_sharing.SharingClient(f"<profile_path>/config.share")

client.list_all_tables()

If the output is empty or doesn’t contain the tables you expect, contact the data provider.

Access shared data using pandas

To access shared data in pandas using Python, run the following, replacing the variables as follows:

  • <profile_path>: the location of the credential file.

  • <share_name>: the value of share= for the table.

  • <schema_name>: the value of schema= for the table.

  • <table_name>: the value of name= for the table.

import delta_sharing
delta_sharing.load_as_pandas(f"<profile_path>#<share_name>.<schema_name>.<table_name>")

Access a shared change data feed using pandas

To access the change data feed for a shared table in pandas using Python run the following, replacing the variables as follows. A change data feed may not be available, depending on whether or not the data provider shared the change data feed for the table.

  • <starting_version>: optional. The starting version of the query, inclusive.

  • <ending_version>: optional. The ending version of the query, inclusive.

  • <starting_timestamp>: optional. The starting timestamp of the query. This is converted to a version created greater or equal to this timestamp.

  • <ending_timestamp>: optional. The ending timestamp of the query. This is converted to a version created earlier or equal to this timestamp.

import delta_sharing
delta_sharing.load_table_changes_as_pandas(
  f"<profile_path>#<share_name>.<schema_name>.<table_name>",
  starting_version=<starting_version>,
  ending_version=<starting_version>)

delta_sharing.load_table_changes_as_pandas(
  f"<profile_path>#<share_name>.<schema_name>.<table_name>",
  starting_timestamp=<starting_timestamp>,
  ending_timestamp=<ending_timestamp>)

If the output is empty or doesn’t contain the data you expect, contact the data provider.

Power BI: Read shared data

The Power BI Delta Sharing connector allows you to discover, analyze, and visualize datasets shared with you through the Delta Sharing open protocol.

Requirements

Connect to Databricks

To connect to Databricks using the Delta Sharing connector, do the following:

  1. Open the shared credential file with a text editor to retrieve the endpoint URL and the token.

  2. Open Power BI Desktop.

  3. On the Get Data menu, search for Delta Sharing.

  4. Select the connector and click Connect.

  5. Enter the endpoint URL that you copied from the credentials file into the Delta Sharing Server URL field.

  6. Optionally, in the Advanced Options tab, set a Row Limit for the maximum number of rows that you can download. This is set to 1 million rows by default.

  7. Click OK.

  8. For Authentication, copy the token that you retrieved from the credentials file into Bearer Token.

  9. Click Connect.

Limitations of the Power BI Delta Sharing connector

The Power BI Delta Sharing Connector has the following limitations:

  • It is not yet available in the Power BI service.

  • The data that the connector loads must fit into the memory of your machine. To ensure this, the connector limits the number of imported rows to the Row Limit that you set under the Advanced Options tab in Power BI Desktop.