Google Cloud Storage
This article describes how to read from and write to Google Cloud Storage (GCS) tables in Databricks. To read or write from a GCS bucket, you must create an attached service account and you must associate the bucket with the service account when creating a cluster.
Connect to the bucket directly with a key that you generate for the service account.
Access a GCS bucket directly
In this section:
To read and write directly to a bucket, you configure a key defined in your Spark config.
Step 1: Set up Google Cloud service account using Google Cloud Console
You must create a service account for the Databricks cluster. Databricks recommends giving this service account the least privileges needed to perform its tasks.
Click IAM and Admin in the left navigation pane.
Click Service Accounts.
Click + CREATE SERVICE ACCOUNT.
Enter the service account name and description.
Create a key. See Create a key to access GCS bucket directly.
Create a key to access GCS bucket directly
The JSON key you generate for the service account is a private key that should only be shared with authorized users as it controls access to datasets and resources in your Google Cloud account.
In the Google Cloud console, in the service accounts list, click the newly created account.
In the Keys section, click ADD KEY > Create new key.
Accept the JSON key type.
Click CREATE. The key file is downloaded to your computer.
Step 2: Configure the GCS bucket
Create a bucket
If you do not already have a bucket, create one:
Click Storage in the left navigation pane.
Click CREATE BUCKET.
Step 3: Set up Databricks cluster
When you configure your cluster:
In the Databricks Runtime Version drop-down, select 7.3 LTS or above.
In the Spark Config tab, add all of the following Spark configuration. Replace
<private_key_id>with the values of those exact field names from your key JSON file.
The value for
<private_key>spans multiple lines. Paste the entire private key. Do not include the leading and trailing quotes.
spark.hadoop.google.cloud.auth.service.account.enable true spark.hadoop.fs.gs.auth.service.account.email <client_email> spark.hadoop.fs.gs.project.id <project_id> spark.hadoop.fs.gs.auth.service.account.private.key <private_key> spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>
Step 4: Usage
To read from the GCS bucket, use a Spark read command in any supported format, for example:
df = spark.read.format("parquet").load("gs://<bucket-name>/<path>")
To write to the GCS bucket, use a Spark write command in any supported format, for example:
<bucket-name> with the name of the bucket you created in Step 2: Configure the GCS bucket.