Skip to main content

Connect to a Google Cloud Storage bucket

note

Only SAP Databricks accounts deployed on GCP can connect to Google Cloud Storage.

This page describes how to connect a Google Cloud Storage (GCS) bucket to your SAP Databricks account.

To connect to external cloud storage in SAP Databricks, you need:

  • Storage credential: Represents an authentication and authorization mechanism for accessing data stored on your cloud tenant, for example, a service account for GCS buckets. Admins can assign privileges to control which users and groups can use the credential to define external locations. This should only be granted to users who need to create external location objects.

  • External location: This combination of a cloud storage path and a storage credential authorizes access to the cloud storage path. Privileges granted on the external location govern who can access the cloud storage path defined by the external location.

warning

To prevent data loss, SAP Databricks requires that external locations be read-only.

Before you begin

Prerequisites:

  • The Google Cloud Storage bucket you reference in the external location must exist before you create the external location object in SAP Databricks. To avoid egress charges, the bucket should be in the same region as the workspace from which you want to access the data.
  • You must have permission to modify the access policy for that bucket.

Databricks permissions requirements:

  • CREATE STORAGE CREDENTIAL privilege on the metastore attached to the workspace. Account admins and metastore admins have this privilege by default.
  • CREATE EXTERNAL LOCATION privilege on both the metastore and the storage credential referenced in the external location. Metastore admins have CREATE EXTERNAL LOCATION on the metastore by default.

Create a storage credential

To create a storage credential, you'll generate a Google Cloud service account using the SAP Databricks Catalog Explorer:

  1. Log in to your Unity Catalog-enabled SAP Databricks workspace as a user who has the CREATE STORAGE CREDENTIAL privilege on the metastore.

  2. In the sidebar, click Data icon. Catalog.

  3. On the Quick access page, click the External data > button, go to the Credentials tab, and select Create credential.

  4. Select a Credential Type of GCP Service Account.

  5. Enter a Storage credential name and an optional comment.

  6. Select Read only so that the external locations that use this storage credential will be read-only.

  7. Click Create.

    SAP Databricks creates the storage credential and generates a Google Cloud service account.

  8. On the Storage credential created dialog, make a note of the service account ID, which is in the form of an email address, and click Done.

  9. (Optional) Bind the storage credential to specific workspaces.

    By default, any privileged user can use the storage credential on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the Workspaces tab and assign workspaces. See (Optional) Assign an external location to specific workspaces.

Configure permissions for the service account

  1. Go to the Google Cloud console and open the GCS bucket that you want to access from your SAP Databricks workspace.

    To avoid egress charges, the bucket should be in the same region as the workspace you want to access the data from.

  2. On the Permission tab, click + Grant access and assign the newly created service account the following roles:

    • Storage Legacy Bucket Reader
    • Storage Object Admin

    Use the service account’s email address as the principal identifier.

note

This step is optional but highly recommended. If you do not grant SAP Databricks access to configure file events on your behalf, you must configure file events manually for each location. If you do not, you will have limited access to critical features that Databricks may release in the future.

The steps below allow Databricks to set up a complete notification pipeline to publish event notification messages from your GCS buckets to Google Cloud Pub/Sub. They assume that you have a GCP project with a GCS bucket and have enabled the Pub/Sub API.

  1. Create a custom IAM role for file events.

    1. In the Google Cloud console for the project containing your GCS bucket, navigate to IAM & Admin > Roles.

    2. If you already have a custom IAM role, select it and click Edit Role. Otherwise, create a new role by clicking + Create Role from the Roles page.

    3. On the Create Role or Edit Role screen, add the following permissions to your custom IAM role and save the changes. For detailed instructions, see the GCP documentation.

      pubsub.subscriptions.consume
      pubsub.subscriptions.create
      pubsub.subscriptions.delete
      pubsub.subscriptions.get
      pubsub.subscriptions.list
      pubsub.subscriptions.update
      pubsub.topics.attachSubscription
      pubsub.topics.create
      pubsub.topics.delete
      pubsub.topics.get
      pubsub.topics.list
      pubsub.topics.update
      storage.buckets.update
  2. Grant access to the role.

    1. Navigate to IAM & Admin > IAM.
    2. Click Grant Access.
    3. Enter your service account as the principal.
    4. Select your custom IAM role.
    5. Click Save.
  3. Grant permissions to the Cloud Storage Service Agent

    1. Find the service agent account email by following these steps in the Google Cloud documentation.
    2. In the Google Cloud console, navigate to IAM & Admin > IAM > Grant Access.
    3. Enter the service agent account email and assign the Pub/Sub Publisher* role.

You can now create an external location that references this storage credential.

Create an external location

To create the external location:

  1. Log in to a workspace that is attached to the metastore.

  2. In the sidebar, click Data icon. Catalog.

  3. On the Quick access page, click the External data > button, go to the External Locations tab, and click Create location.

  4. Enter an External location name.

  5. Under URL, enter or select the path to the external location. For example, gs://mybucket/<path>.

  6. Select the storage credential that grants access to the external location.

    If you don't have a storage credential, you can create one:

    1. In the Storage credential drop-down list, select + Create new storage credential.
    2. In the Credential type drop-down list, select GCP Service Account.
    3. A GCP service account is created for you automatically when you save the external location.
  7. Ensure the external location has read-only access. Click Advanced Options and ensure Read only is selected.

  8. (Optional) To enable the ability to subscribe to change notifications on the external location, click Advanced Options and select Enable file events.

  9. Click Create.

  10. (Optional) Bind the external location to specific workspaces.

    By default, any privileged user can use the external location on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the Workspaces tab and assign workspaces. See Bind an external location to one or more workspaces.

  11. Go to the Permissions tab to grant permission to use the external location.

    For anyone to use the external location you must grant permissions:

    • To use the external location to add a managed storage location to metastore, catalog, or schema, grant the CREATE MANAGED LOCATION privilege.

    • To create external tables or volumes, grant CREATE EXTERNAL TABLE or CREATE EXTERNAL VOLUME.

    1. Click Grant.
    2. On the Grant on <external location> dialog, select users, groups, or service principals in Principals field, and select the privilege you want to grant.
    3. Click Grant.

(Optional) Assign an external location to specific workspaces

By default, an external location is accessible from all of the workspaces in the metastore. This means that if a user has been granted a privilege (such as READ FILES) on that external location, they can exercise that privilege from any workspace attached to the metastore. If you use workspaces to isolate user data access, you might want to allow access to an external location only from specific workspaces. This feature is known as workspace binding or external location isolation.

Typical use cases for binding an external location to specific workspaces include:

  • Ensuring that data engineers who have the CREATE EXTERNAL TABLE privilege on an external location that contains production data can create external tables on that location only in a production workspace.
  • Ensuring that data engineers who have the READ FILES privilege on an external location that contains sensitive data can only use specific workspaces to access that data.
important

Workspace bindings are referenced at the point when privileges against the external location are exercised. For example, if a user creates an external table by issuing the statement CREATE TABLE myCat.mySch.myTable LOCATION 'gs://mybucket/<path>' from the myWorkspace workspace, the following workspace binding checks are performed in addition to regular user privilege checks:

  • Is the external location covering 'gs://mybucket/<path>' bound to myWorkspace?
  • Is the catalog myCat bound to myWorkspace with access level Read & Write?

If the external location is subsequently unbound from myWorkspace, then the external table continues to function.

This feature also allows you to populate a catalog from a central workspace and make it available to other workspaces using catalog bindings without making the external location available in those other workspaces.

Bind an external location to one or more workspaces

To assign an external location to specific workspaces, you can use Catalog Explorer.

Permissions required: Metastore admin, external location owner, or MANAGE on the external location.

Metastore admins can see all external locations in a metastore using Catalog Explorer, and external location owners can see all external locations that they own in a metastore—regardless of whether the external location is assigned to the current workspace. External locations that are not assigned to the workspace appear grayed out.

  1. Log in to a workspace that is linked to the metastore.

  2. In the sidebar, click Data icon. Catalog.

  3. On the Quick access page, click the External data > button to go to the External Locations tab.

  4. Select the external location and go to the Workspaces tab.

  5. On the Workspaces tab, clear the All workspaces have access checkbox.

    If your external location is already bound to one or more workspaces, this checkbox is already cleared.

  6. Click Assign to workspaces and enter or find the workspaces you want to assign.

To revoke access, go to the Workspaces tab, select the workspace, and click Revoke. To allow access from all workspaces, select the All workspaces have access checkbox.