Configure AWS storage
This article describes how to configure Amazon Web Services S3 buckets for two different use cases:
Root storage for a workspace: Root storage for workspace objects like cluster logs, notebook revisions, and job results libraries. To create a new workspace using the account console or with the Account API, you must first set up an S3 bucket to use as your workspace’s root storage.
Log delivery (all deployment types): Storage for delivery of logs such as billable usage or audit logs. For more information, see Deliver and access billable usage logs and Configure audit logging.
Tip
You can automate AWS storage deployment using Databricks Terraform provider. See Create Databricks workspaces using Terraform.
Databricks recommends that you review Security Best Practices for S3 for guidance around protecting the data in your bucket from unwanted access.
Preview
Step 1: Create an S3 bucket
Log into your AWS Console as a user with administrator privileges and go to the S3 service.
Create an S3 bucket. See Create a Bucket in the AWS documentation.
Important
The S3 bucket must be in the same AWS region as the Databricks deployment.
Databricks recommends as a best practice that you use an S3 bucket that is dedicated to Databricks, unshared with other resources or services.
Do not reuse a bucket from legacy workspaces. For example, if you are migrating to E2, create a new AWS bucket for your E2 setup.
Step 2: Apply bucket policy (workspace creation only)
Note
This step is necessary only if you are setting up root storage for a new workspace that you create with the Account API. Skip this step if you are setting up storage for log delivery.
In the AWS Console, go to the S3 service.
Click the bucket name.
Click the Permissions tab.
Click the Bucket Policy button.
Copy and modify this bucket policy. Replace
<s3-bucket-name>
with the S3 bucket name:Note
If you are creating your storage configuration using the account console, you can also generate the bucket policy directly from the Add Storage Configuration dialog. See Manage storage configurations using the account console.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Grant Databricks Access", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::414351767826:root" }, "Action": [ "s3:GetObject", "s3:GetObjectVersion", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": [ "arn:aws:s3:::<s3-bucket-name>/*", "arn:aws:s3:::<s3-bucket-name>" ], "Condition": { "StringEquals": { "aws:PrincipalTag/DatabricksAccountId": ["YOUR_DATABRICKS_ACCOUNT_ID"] } } } ] }
Create the lifecycle policy described in Advanced Configurations.
Step 3: Set S3 object ownership (log delivery only)
Note
This step is necessary only if you are setting up storage for log delivery. Skip this step if you are setting up root storage for a new workspace.
Access to the logs depends on how you set up the S3 bucket. Databricks delivers logs to your S3 bucket with AWS’s built-in BucketOwnerFullControl Canned ACL so that account owners and designees can download the logs directly.
To support bucket ownership for newly-created objects, you must set your bucket’s S3 Object Ownership setting to the value Bucket owner preferred.
Important
If instead you set your bucket’s S3 Object Ownership setting to Object writer, new objects such as your logs remain owned by the uploading account, which is by default the IAM role that Databricks uses to access the bucket. This can make it difficult to access the logs, because you cannot access them from the AWS console or automation tools that you authenticated with as the bucket owner.
Step 4: Enable S3 object-level logging (recommended)
Databricks strongly recommends that you enable S3 object-level logging for your root storage bucket. This enables faster investigation of any issues that may come up. Be aware that S3 object-level logging can increase AWS usage costs.
For instructions, see the AWS documentation on CloudTrail event logging for S3 buckets and objects.
Resolve validation failures
Bucket policy permissions can take a few minutes to propagate. Retry this procedure if validation fails due to permissions.
Verify correct permissions
When creating a storage configuration for your bucket, Databricks checks whether your bucket has been set up with correct permissions. One of these checks writes a file in your bucket and immediately deletes it. However, if the delete operation fails, the temporary object remains at the root of your bucket. The object name will begin with databricks-verification-<uuid>
.
If you see this object, it is likely because of a misconfiguration in the bucket policy. Databricks has PUT permissions but not DELETE permissions. Review the bucket policy and ensure the permissions are configured correctly.