Access data shared with you using Delta Sharing

Preview

Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group in the Databricks Account Console. See Enable the External Data Sharing feature group for your account.

Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of those terms.

This article shows data recipients how to access data shared from Delta Sharing using Databricks, Apache Spark, and pandas.

Delta Sharing and data recipients

Delta Sharing is an open standard for secure data sharing. A Databricks user, called a “data provider”, can use Delta Sharing to share data with a person or group outside of their organization, called a “data recipient”. For a full list of connectors and information about how to use them, see the Delta Sharing documentation. This article shows data recipients how to access data shared from Delta Sharing using Databricks, Apache Spark, and pandas.

The shared data is not provided by Databricks directly but by data providers running on Databricks.

Note

By accessing a data provider’s shared data as a data recipient, data recipient represents that it has been authorized to access the data share(s) provided to it by the data provider and acknowledges that (1) Databricks has no liability for such data or data recipient’s use of such shared data, and (2) Databricks may collect information about data recipient’s use of and access to the shared data (including identifying any individual or company who accesses the data using the credential file in connection with such information) and may share it with the applicable data provider.

Download the credential file

To grant you access to shared data, the data provider sends you an activation URL, where you can download a credential file. This article shows how you can use the credential file to access the shared data.

Important

You can download a credential file only one time. If you visit the activation link again after you have already downloaded the credential file, the Download Credential File button is disabled.

  • Don’t further share the activation link or share the credential file you have received with anyone outside of your organization.

  • If you lose the activation link before using it, contact the data provider.

  1. Click the activation link shared with you by the data provider. The activation page opens in your browser.

  2. Click Download Credential File.

Store the credential file in a secure location. If you need to share it with someone in your organization, Databricks recommends using a password manager.

Read the shared data

After you download the credential file, your code uses the credential file to authenticate to the data provider’s Databricks account and read the data the data provider shared with you. The access will persist until the provider stops sharing the data with you. Updates to the data are available to you in near real time. You can read and make copies of the shared data, but you can’t modify the source data.

The following sections show how to access shared data using Databricks, Apache Spark, and pandas. If you run into trouble accessing the shared data, contact the data provider.

Note

Partner integrations are, unless otherwise noted, provided by the third parties and you must have an account with the appropriate provider for the use of their products and services. While Databricks does its best to keep this content up to date, we make no representation regarding the integrations or the accuracy of the content on the partner integration pages. Reach out to the appropriate providers regarding the integrations.

Use Databricks

Follow these steps to access shared data in a Databricks workspace using notebook commands. You store the credential file in DBFS, then use it to authenticate to the data provider’s Databricks account and read the data the data provider shared with you.

In this example, you create a notebook with multiple cells you can run independently. You could instead add the notebook commands to the same cell and run them in a sequence.

  1. In a text editor, open the credential file you downloaded.

  2. Click Workspace Icon Workspace.

  3. Right-click a folder, then click Create > Notebook.

    • Enter a name.

    • Set the default language for the notebook to Python. This is the default.

    • Select a cluster to attach to the notebook. Select a cluster that runs Databricks Runtime 8.4 or above or a cluster with the Apache Spark connector library installed. For more information about installing cluster libraries, see Libraries.

    • Click Create.

    The notebook opens in the notebook editor.

  4. To use Python or pandas to access the shared data, install the delta-sharing Python connector. In the notebook editor, paste the following command:

    %sh pip install delta-sharing
    
  5. In the cell actions menu Cell actions at the far right, click Run Icon and select Run Cell, or press shift+enter.

    The delta-sharing Python library is installed in the cluster if it isn’t already installed.

  6. In the notebook editor, click Down Caret, and select Add Cell Below. In the new cell, paste the following command, which uploads the contents of the credential file to a folder in DBFS. Replace the variables as follows:

    • <dbfs-path>: the path to the folder where you want to save the credential file

    • <credential-file-contents>: the contents of the credential file.

      The credential file contains JSON which defines three fields: shareCredentialsVersion, endpoint, and bearerToken.

      %scala
      dbutils.fs.put("<dbfs-path>/config.share","""
      <credential-file-contents>
      """)
      
  7. In the cell actions menu Cell actions at the far right, click Run Icon and select Run Cell, or press shift+enter.

    After the credential file is uploaded, you can delete this cell. All workspace users can read the credential file from DBFS, and the credential file is available in DBFS on all clusters and SQL warehouses. To delete the cell, click x in the cell actions menu Cell actions at the far right.

  8. Using Python, list the tables in the share. In the notebook editor, click Down Caret, and select Add Cell Below.

  9. In the new cell, paste the following command. Replace <dbfs-path> with the path from the previous command.

    When the code runs, Python reads the credential file from DBFS on the cluster. DBFS is mounted using FUSE at /dbfs/.

    import delta_sharing
    
    client = delta_sharing.SharingClient(f"/dbfs/<dbfs-path>/config.share")
    
    client.list_all_tables()
    
  10. In the cell actions menu Cell actions at the far right, click Run Icon and select Run Cell, or press shift+enter.

    The result is an array of tables, along with metadata for each table. The following output shows two tables:

    Out[10]: [Table(name='example_table', share='example_share_0', schema='default'), Table(name='other_example_table', share='example_share_0', schema='default')]
    

    If the output is empty or doesn’t contain the tables you expect, contact the data provider.

  11. Using Scala, query a shared table. In the notebook editor, click Down Caret, and select Add Cell Below. In the new cell, paste the following command. When the code runs, the credential file is read from DBFS through the JVM.

    Replace the variables:

    • <profile_path>: the DBFS path of the credential file. For example, /<dbfs-path>/config.share.

    • <share_name>: the value of share= for the table.

    • <schema_name>: the value of schema= for the table.

    • <table_name>: the value of name= for the table.

    %scala
     spark.read.format("deltaSharing")
       .load("<profile_path>#<share_name>.<schema_name>.<table_name>").limit(10);
    
  12. In the cell actions menu Cell actions at the far right, click Run Icon and select Run Cell, or press shift+enter.

    Each time you load the shared table, you see fresh data from the source.

  13. To query the shared data using SQL, you must first create a local table in the workspace from the shared table, then query the local table.

    The shared data is not stored or cached in the local table. Each time you query the local table, you see the current state of the shared data.

    In the notebook editor, click Down Caret, and select Add Cell Below. In the new cell, paste the following command.

    Replace the variables:

    • <local_table_name>: the name of the local table.

    • <profile_path>: the location of the credential file.

    • <share_name>: the value of share= for the table.

    • <schema_name>: the value of schema= for the table.

    • <table_name>: the value of name= for the table.

    %sql
    DROP TABLE IF EXISTS table_name;
    
    CREATE TABLE <local_table_name> USING deltaSharing LOCATION "<profile_path>#<share_name>.<schema_name>.<table_name>";
    
    SELECT * FROM <local_table_name> LIMIT 10;
    
  14. In the cell actions menu Cell actions at the far right, click Run Icon and select Run Cell, or press shift+enter.

    When you run the command, the shared data is queried directly. As a test, the table is queried and the first 10 results are returned.

If the output is empty or doesn’t contain the data you expect, contact the data provider.

Audit access and activity for Delta Sharing resources

After you configure audit logging, Delta Sharing saves audit logs for activities such as when someone creates, modifies, updates, or deletes a share or a recipient, when a recipient accesses an activation link and downloads the credential, or when a recipient’s credential is rotated or expires. Delta Sharing activity is logged at the account level.

  1. Enable audit logs for your account.

    Important

    Delta Sharing activity is logged at the level of the account. Do not enter a value into workspace_ids_filter.

    Audit logs are delivered for each workspace in your account, as well as account-level activities. Logs are delivered to the S3 bucket you configure.

  1. Events for Delta Sharing are logged with serviceName set to unityCatalog. The requestParams section of each event includes the following fields, which you can share with the data provider to help them troubleshoot issues.

    • recipient_name: The name of the recipient in the data provider’s system.

    • metastore_id: The name of the metastore in the data provider’s system.

    • sourceIPAddress: The IP address where the request originated.

    For example, the following audit event shows that a recipient successfully listed the shares that were available to them. In this example, redacted values are replaced with <redacted>.

    {
     "Version":"2.0",
     "auditLevel":"ACCOUNT_LEVEL",
     "Timestamp":1635235341950,
     "orgId":"0",
     "shardName":"<redacted>",
     "accountId":"<redacted>",
     "sourceIPAddress":"<redacted>",
     "userAgent":null,
     "sessionId":null,
     "userIdentity":null,
     "serviceName":"unityCatalog",
     "actionName":"deltaSharingListShares",
     "requestId":"ServiceMain-cddd3114b1b40003",
     "requestParams":{
       "Metastore_id":"<redacted>",
       "Options":"{}",
       "Recipient_name":"<redacted>"
       },
     "Response":{
       "statusCode":200,
       "errorMessage":null,
       "Result":null
     },
     "MAX_LOG_MESSAGE_LENGTH":16384
    }
    

The following table lists audited events for Delta Sharing, from the point of view of the data recipient.

action

requestParams

deltaSharingListShares

options: The pagination options provided with this request).

deltaSharingListSchemas

share: The name of the share.

options: The pagination options provided with this request.

deltaSharingListTables

share: The name of the share).

options: The pagination options provided with this request).

deltaSharingListAllTables

share: The name of the share.

deltaSharingGetTableVersion

share: The name of the share.

schema: The name of the schema.

name: The name of the table.

deltaSharingGetTableMetadata

share: The name of the share.

schema: The name of the schema.

name: The name of the table.

predicateHints: The predicates included in the query.

limitHints: The maximum number of rows to return.

The following Delta Sharing errors are logged, from the point of view of the data recipient. Items between < and > characters represent placeholder text.

  • The user attempted to access a share they do not have permission to access.

    DatabricksServiceException: PERMISSION_DENIED:
    User does not have SELECT on Share <share_name>`
    
  • The user attempted to access a share that does not exist.

    DatabricksServiceException: SHARE_DOES_NOT_EXIST: Share <share_name> does not exist.
    
  • The user attempted to access a table that does not exist in the share.

    DatabricksServiceException: TABLE_DOES_NOT_EXIST: <table_name> does not exist.
    

For auditable events and errors for data providers, see Audit access and activity for Delta Sharing resources.

Access shared data outside Databricks

If you don’t use Databricks, follow these instructions to access shared data.

Use Apache Spark

Follow these steps to access shared data in Apache Spark 3.x or above.

  1. To access metadata related to the shared data, such as the list of tables shared with you, install the delta-sharing Python connector:

    pip install delta-sharing
    
  2. Install the Apache Spark connector.

  3. List the tables in the share. In the following example, replace <profile_path> with the location of the credential file.

     import delta_sharing
    
     client = delta_sharing.SharingClient(f"<profile_path>/config.share")
    
     client.list_all_tables()
    

    The result is an array of tables, along with metadata for each table. The following output shows two tables:

     Out[10]: [Table(name='example_table', share='example_share_0', schema='default'), Table(name='other_example_table', share='example_share_0', schema='default')]
    

    If the output is empty or doesn’t contain the tables you expect, contact the data provider.

  4. Access shared data in Spark using Python:

    delta_sharing.load_as_spark(f"<profile_path>#<share_name>.<schema_name>.<table_name>")
    
    spark.read.format("deltaSharing")\
    .load("<profile_path>#<share_name>.<schema_name>.<table_name>")\
    .limit(10))
    

    Replace the variables as follows:

    • <profile_path>: the location of the credential file.

    • <share_name>: the value of share= for the table.

    • <schema_name>: the value of schema= for the table.

    • <table_name>: the value of name= for the table.

  5. Access shared data in Apache Spark using Scala:

    spark.read.format("deltaSharing")
    .load("<profile_path>#<share_name>.<schema_name>.<table_name>")
    .limit(10)
    

    Replace the variables as follows:

    • <profile_path>: the location of the credential file.

    • <share_name>: the value of share= for the table.

    • <schema_name>: the value of schema= for the table.

    • <table_name>: the value of name= for the table.

If the output is empty or doesn’t contain the data you expect, contact the data provider.

Use pandas

Follow these steps to access shared data in pandas 0.25.3 or above.

  1. To access metadata related to the shared data, such as the list of tables shared with you, you must install the `delta-sharing` Python connector:

    pip install delta-sharing
    
  2. List the tables in the share. In the following example, replace <profile_path> with the location of the credential file.

     import delta_sharing
    
     client = delta_sharing.SharingClient(f"<profile_path>/config.share")
    
     client.list_all_tables()
    

    If the output is empty or doesn’t contain the tables you expect, contact the data provider.

  3. Access shared data in pandas using Python. In the following example, replace the variables as follows:

    • <profile_path>: the location of the credential file.

    • <share_name>: the value of share= for the table.

    • <schema_name>: the value of schema= for the table.

    • <table_name>: the value of name= for the table.

    import delta_sharing
    delta_sharing.load_as_pandas(f"<profile_path>#<share_name>.<schema_name>.<table_name>")
    

If the output is empty or doesn’t contain the data you expect, contact the data provider.