Connect to a Google Cloud Storage (GCS) external location
This page describes how to connect to a Google Cloud Storage (GCS) external location. After completing this connection, you can govern access to these GCS objects using Unity Catalog.
To successfully connect to a GCS bucket path, you need two Unity Catalog securable objects. The first is a storage credential, which specifies an IAM role that allows access to the GCS bucket. You need this storage credential for the second required object: an external location, which defines the path to your GCS storage location and the credentials required to access that location.
Requirements
In Databricks:
- Databricks workspace enabled for Unity Catalog.
CREATE STORAGE CREDENTIAL
privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default.CREATE EXTERNAL LOCATION
privilege on both the Unity Catalog metastore and the storage credential referenced by the external location. Metastore admins and workspace admins have this privilege by default.
In your Google Cloud account:
- A GCS bucket. To avoid egress charges, this should be in the same region as the workspace you want to access the data from.
- External location paths must contain only standard ASCII characters (letters
A–Z
,a–z
, digits0–9
, and common symbols like/
,_
,-
). - Google Cloud Storage hierarchical namespace (HNS) is not supported with external locations. Disable hierarchical namespace before creating an external location.
- External location paths must contain only standard ASCII characters (letters
- Permission to modify the access policy for that bucket.
Create a storage credential that accesses GCS
To create a storage credential for access to a GCS bucket, you give Unity Catalog the ability to read and write to the bucket by assigning IAM roles on that bucket to a Databricks-generated Google Cloud service account.
Generate a Google Cloud service account using Catalog Explorer
-
Log in to your Unity Catalog-enabled Databricks workspace as a user who has the
CREATE STORAGE CREDENTIAL
privilege on the metastore. -
In the sidebar, click
Catalog.
-
On the Quick access page, click the External data > button, go to the Credentials tab, and select Create credential.
-
Select a Credential Type of GCP Service Account.
-
Enter a Storage credential name and an optional comment.
-
(Optional) If you want users to have read-only access to the external locations that use this storage credential, click Advanced Options and select Limit to read-only use. For more information, see Mark a storage credential as read-only.
-
Click Create.
Databricks creates the storage credential and generates a Google Cloud service account.
-
On the Credential created dialog, make a note of the service account ID, which is in the form of an email address, and click Done.
-
(Optional) Bind the storage credential to specific workspaces.
By default, any privileged user can use the storage credential on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the Workspaces tab and assign workspaces. See Assign a storage credential to specific workspaces.
Configure permissions for the service account
You now have a storage credential in Databricks that's associated with a Google service account. Before using the storage credential, you must also grant the Google service account permissions to access your specific GCS bucket.
-
Go to the Google Cloud console and open the GCS bucket that you want to access from Databricks.
To avoid egress charges, the bucket should be in the same region as the Databricks workspace you want to access the data from.
-
On the Permission tab, click + Grant access and assign the service account the following roles:
- Storage Legacy Bucket Reader
- Storage Object Admin
Use the service account’s email address as the principal identifier.
-
Click Save.
You can now create an external location that references your storage credential.
(Recommended) Configure permissions for file events
This step is optional but highly recommended. If you don't grant Databricks access to configure file events on your behalf, you must configure file events manually for each location. If you don't, you have limited access to critical features that Databricks might release.
The steps below allow Databricks to set up a complete notification pipeline to publish event notification messages from your GCS buckets to Google Cloud Pub/Sub. They assume that you have a GCP project with a GCS bucket and have enabled the Pub/Sub API.
-
Create a custom IAM role for file events.
-
In the Google Cloud console for the project containing your GCS bucket, navigate to IAM & Admin > Roles.
-
If you already have a custom IAM role, select it and click Edit Role. Otherwise, create a new role by clicking + Create Role from the Roles page.
-
On the Create Role or Edit Role screen, add the following permissions to your custom IAM role and save the changes. For detailed instructions, see the GCP documentation.
pubsub.subscriptions.consume
pubsub.subscriptions.create
pubsub.subscriptions.delete
pubsub.subscriptions.get
pubsub.subscriptions.list
pubsub.subscriptions.update
pubsub.topics.attachSubscription
pubsub.topics.create
pubsub.topics.delete
pubsub.topics.get
pubsub.topics.list
pubsub.topics.update
storage.buckets.update
-
-
Grant access to the role.
- Navigate to IAM & Admin > IAM.
- Click Grant Access.
- Enter your service account as the principal.
- Select your custom IAM role.
- Click Save.
-
Grant permissions to the Cloud Storage Service Agent
- Find the service agent account email by following these steps in the Google Cloud documentation.
- In the Google Cloud console, navigate to IAM & Admin > IAM > Grant Access.
- Enter the service agent account email and assign the Pub/Sub Publisher* role.
You can now create an external location that references this storage credential.
Create an external location for a GCS bucket
This section describes how to create an external location using either Catalog Explorer or SQL. It assumes that you already have a storage credential that allows access to your GCS bucket. If you don't have a storage credential, follow the steps in Create a storage credential that accesses GCS.
Option 1: Create an external location manually using Catalog Explorer
You can create an external location manually using Catalog Explorer.
To create the external location:
-
Log in to a workspace that is attached to the metastore.
-
In the sidebar, click
Catalog.
-
On the Quick access page, click the External data > button, go to the External Locations tab, and click Create external location.
-
Enter an External location name.
-
Under Storage type, select GCP.
-
Under URL, enter the GCS bucket path. For example,
gs://mybucket/<path>
. -
Under Storage credential, select the storage credential that grants access to the external location.
-
(Optional) If you want users to have read-only access to the external location, click Advanced Options and select Limit to read-only use. For more information, see Mark an external location as read-only.
-
(Optional) If the external location is intended for a Hive metastore federated catalog, click Advanced options and enable Fallback mode.
-
(Optional) To enable the ability to subscribe to change notifications on the external location, click Advanced Options and select Enable file events.
For details, see (Recommended) Enable file events for an external location.
-
Click Create.
-
(Optional) Bind the external location to specific workspaces.
By default, any privileged user can use the external location on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the Workspaces tab and assign workspaces. See Assign an external location to specific workspaces.
-
Go to the Permissions tab to grant permission to use the external location.
For anyone to use the external location you must grant permissions:
- To use the external location to add a managed storage location to metastore, catalog, or schema, grant the
CREATE MANAGED LOCATION
privilege. - To create external tables or volumes, grant
CREATE EXTERNAL TABLE
orCREATE EXTERNAL VOLUME
.
- Click Grant.
- On the Grant on
<external location>
dialog, select users, groups, or service principals in Principals field, and select the privilege you want to grant. - Click Grant.
- To use the external location to add a managed storage location to metastore, catalog, or schema, grant the
Option 2: Create an external location using SQL
To create an external location using SQL, run the following command in a notebook or the SQL query editor. Replace the placeholder values. For required permissions and prerequisites, see Requirements.
<location-name>
: A name for the external location. Iflocation_name
includes special characters, such as hyphens (-
), it must be surrounded by backticks (` `
). See Names.<bucket-path>
: The path in your cloud tenant that this external location grants access to. For example,gs://mybucket
.<storage-credential-name>
: The name of the storage credential that authorizes reading from and writing to the bucket. If the storage credential name includes special characters, such as hyphens (-
), it must be surrounded by backticks (` `
).
CREATE EXTERNAL LOCATION [IF NOT EXISTS] `<location-name>`
URL '<bucket-path>'
WITH ([STORAGE] CREDENTIAL `<storage-credential-name>`)
[COMMENT '<comment-string>'];
If you want to limit external location access to specific workspaces in your account, also known as workspace binding or external location isolation, see Assign an external location to specific workspaces.
Verify the connection
To verify that you've successfully created the external location, try to read a file from the external location. For example, suppose that you have an external location gs://external-location-bucket
containing a CSV file named example.csv
. To read from the gs://external-location-bucket/example.csv
file, follow these steps:
-
In the sidebar, click
Workspace.
-
Click Create, then select Notebook.
-
Run the following Python code snippet:
Pythondisplay(dbutils.fs.ls('gs://external-location-bucket/'))
This displays a list of file paths in the external location. In this example, the
gs://external-location-bucket/example.csv
file appears in the output. -
To read a specific file in the external location, run the following Python code snippet:
Pythonspark.read.format("csv") \
.option("header", "true") \
.option("delimiter", ";") \
.load('gs://external-location-bucket/example.csv') \
.display()This displays the data in the
gs://external-location-bucket/example.csv
file.
Next steps
- Grant other users permission to use external locations. See Manage external locations.
- Define managed storage locations using external locations. See Specify a managed storage location in Unity Catalog.
- Define external tables using external locations. See Work with external tables.
- Define external volumes using external locations. See What are Unity Catalog volumes?.