Create a storage credential for connecting to AWS S3

This article describes how to create a storage credential in Unity Catalog to connect to AWS S3.

To manage access to the underlying cloud storage that holds tables and volumes, Unity Catalog uses the following object types:

  • Storage credentials encapsulate a long-term cloud credential that provides access to cloud storage.

  • External locations contain a reference to a storage credential and a cloud storage path.

For more information, see Connect to cloud object storage using Unity Catalog.

Unity Catalog supports two cloud storage options for Databricks on AWS: AWS S3 buckets and Cloudflare R2 buckets. Cloudflare R2 is intended primarily for Delta Sharing use cases in which you want to avoid data egress fees. S3 is appropriate for most other use cases. This article focuses on creating storage credentials for S3. For Cloudflare R2, see Create a storage credential for connecting to Cloudflare R2.

To create a storage credential for access to an S3 bucket, you create an IAM role that authorizes access (read, or read and write) to the S3 bucket path and reference that IAM role in the storage credential definition.

Requirements

In Databricks:

  • Databricks workspace enabled for Unity Catalog.

  • CREATE STORAGE CREDENTIAL privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default.

In your AWS account:

  • An S3 bucket in the same region as the workspaces you want to access the data from.

    The bucket name cannot include dot notation (for example, incorrect.bucket.name.notation). For more bucket naming guidance, see the AWS bucket naming rules.

  • The ability to create IAM roles.

Step 1: Create an IAM role

In AWS, create an IAM role that gives access to the S3 bucket that you want your users to access. This IAM role must be defined in the same account as the S3 bucket.

Tip

If you have already created an IAM role that provides this access, you can skip this step and go straight to Step 2: Give Databricks the IAM role details.

  1. Create an IAM role that will allow access to the S3 bucket.

    Role creation is a two-step process. In this step you create the role, adding a temporary trust relationship policy and a placeholder external ID that you then modify after creating the storage credential in Databricks.

    You must modify the trust policy after you create the role because your role must be self-assuming (that is, it must be configured to trust itself). The role must therefore exist before you add the self-assumption statement. For information about self-assuming roles, see this Amazon blog article.

    To create the policy, you must use a placeholder external ID. An external ID is required in AWS to grant access to your AWS resources to a third party.

    1. Create the IAM role with a Custom Trust Policy.

    2. In the Custom Trust Policy field, paste the following policy JSON.

      This policy establishes a cross-account trust relationship so that Unity Catalog can assume the role to access the data in the bucket on behalf of Databricks users. This is specified by the ARN in the Principal section. It is a static value that references a role created by Databricks. The policy uses the Databricks AWS ARN arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL. If you are are using Databricks on AWS GovCloud use the Databricks on AWS GovCloud ARN arn:aws-us-gov:iam::044793339203:role/unity-catalog-prod-UCMasterRole-1QRFA8SGY15OJ.

      The policy sets the external ID to 0000 as a placeholder. You update this to the external ID of your storage credential in a later step.

      {
        "Version": "2012-10-17",
        "Statement": [{
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL"
            ]
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "0000"
            }
          }
        }]
      }
      
    3. Skip the permissions policy configuration. You’ll go back to add that in a later step.

    4. Save the IAM role.

  2. Create the following IAM policy in the same account as the S3 bucket, replacing the following values:

    • <BUCKET>: The name of the S3 bucket.

    • <KMS-KEY>: Optional. If encryption is enabled, provide the name of the KMS key that encrypts the S3 bucket contents. If encryption is disabled, remove the entire KMS section of the IAM policy.

    • <AWS-ACCOUNT-ID>: The Account ID of your AWS account (not your Databricks account).

    • <AWS-IAM-ROLE-NAME>: The name of the AWS IAM role that you created in the previous step.

    This IAM policy grants read and write access. You can also create a policy that grants read access only. However, this may be unnecessary, because you can mark the storage credential as read-only, and any write access granted by this IAM role will be ignored.

    {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Action": [
                  "s3:GetObject",
                  "s3:PutObject",
                  "s3:DeleteObject",
                  "s3:ListBucket",
                  "s3:GetBucketLocation"
              ],
              "Resource": [
                  "arn:aws:s3:::<BUCKET>/*",
                  "arn:aws:s3:::<BUCKET>"
              ],
              "Effect": "Allow"
          },
          {
              "Action": [
                  "kms:Decrypt",
                  "kms:Encrypt",
                  "kms:GenerateDataKey*"
              ],
              "Resource": [
                  "arn:aws:kms:<KMS-KEY>"
              ],
              "Effect": "Allow"
          },
          {
              "Action": [
                  "sts:AssumeRole"
              ],
              "Resource": [
                  "arn:aws:iam::<AWS-ACCOUNT-ID>:role/<AWS-IAM-ROLE-NAME>"
              ],
              "Effect": "Allow"
          }
        ]
    }
    

    Note

    If you need a more restrictive IAM policy for Unity Catalog, contact your Databricks account team for assistance.

  3. Attach the IAM policy to the IAM role.

    In the Role’s Permission tab, attach the IAM Policy you just created.

Step 2: Give Databricks the IAM role details

  1. In Databricks, log in to a workspace that is linked to the metastore.

    You must have the CREATE STORAGE CREDENTIAL privilege. The metastore admin and account admin roles both include this privilege.

  2. Click Catalog icon Catalog.

  3. At the top of the Catalog pane, click the Add or plus icon Add icon and select Add a storage credential from the menu.

    This option does not appear if you don’t have the CREATE STORAGE CREDENTIAL privilege.

    Alternatively, from the Quick access page, click the External data > button, go to the Storage Credentials tab, and select Create credential.

  4. Select a Credential Type of AWS IAM Role.

  5. Enter a name for the credential, the IAM Role ARN that authorizes Unity Catalog to access the storage location on your cloud tenant, and an optional comment.

    Tip

    If you have already defined an instance profile in Databricks, you can click Copy instance profile to copy over the IAM role ARN for that instance profile. The instance profile’s IAM role must have a cross-account trust relationship that enables Databricks to assume the role in order to access the bucket on behalf of Databricks users. For more information about the IAM role policy and trust relationship requirements, see Step 1: Create an IAM role.

  6. (Optional) If you want users to have read-only access to the external locations that use this storage credential, in Advanced options select Read only. For more information, see Mark a storage credential as read-only.

  7. Click Create.

  8. In the Storage credential created dialog, copy the External ID.

  9. Click Done.

  10. (Optional) Bind the storage credential to specific workspaces.

    By default, any privileged user can use the storage credential on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the Workspaces tab and assign workspaces. See (Optional) Assign a storage credential to specific workspaces.

  11. Create an external location that references this storage credential.

You can also create a storage credential by using Databricks Terraform provider and databricks_storage_credential.

Step 3: Update the IAM role policy

In AWS, modify the trust relationship policy to add your storage credential’s external ID and make it self-assuming.

  1. Return to your saved IAM role and go to the Trust Relationships tab.

  2. Edit the trust relationship policy as follows:

    Add the following ARN to the “Allow” statement. Replace <YOUR-AWS-ACCOUNT-ID> and <THIS-ROLE-NAME> with your actual account ID and IAM role values.

    "arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role/<THIS-ROLE-NAME>"
    

    In the "sts:AssumeRole" statement, update the placeholder external ID to your storage credential’s external ID that you copied in the previous step.

    "sts:ExternalId": "<STORAGE-CREDENTIAL-EXTERNAL-ID>"
    

    Your policy should now look like the following, with the replacement text updated to use your storage credential’s external ID, account ID, and IAM role values:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL",
              "arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role/<THIS-ROLE-NAME>"
            ]
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "<STORAGE-CREDENTIAL-EXTERNAL-ID>"
            }
          }
        }
      ]
    }
    

(Optional) Assign a storage credential to specific workspaces

Preview

This feature is in Public Preview.

By default, a storage credential is accessible from all of the workspaces in the metastore. This means that if a user has been granted a privilege (such as CREATE EXTERNAL LOCATION) on that storage credential, they can exercise that privilege from any workspace attached to the metastore. If you use workspaces to isolate user data access, you may want to allow access to a storage credential only from specific workspaces. This feature is known as workspace binding or storage credential isolation.

A typical use case for binding a storage credential to specific workspaces is the scenario in which a cloud admin configures a storage credential using a production cloud account credential, and you want to ensure that Databricks users use this credential to create external locations only in the production workspace.

For more information about workspace binding, see (Optional) Assign an external location to specific workspaces and Limit catalog access to specific workspaces.

Note

Workspace bindings are referenced when privileges against storage credentials are exercised. For example, if a user creates an external location using a storage credential, the workspace binding on the storage credential is checked only when the external location is created. After the external location is created, it will function independently of the workspace bindings configured on the storage credential.

Bind a storage credential to one or more workspaces

To assign a storage credential to specific workspaces, you can use Catalog Explorer or the Databricks CLI.

Permissions required: Metastore admin or storage credential owner.

Note

Metastore admins can see all storage credentials in a metastore using Catalog Explorer—and storage credential owners can see all storage credentials that they own in a metastore—regardless of whether the storage credential is assigned to the current workspace. Storage credentials that are not assigned to the workspace appear grayed out.

  1. Log in to a workspace that is linked to the metastore.

  2. In the sidebar, click Catalog icon Catalog.

  3. At the top of the Catalog pane, click the Gear icon gear icon and select Storage Credentials.

    Alternatively, from the Quick access page, click the External data > button and go to the Storage Credentials tab.

  4. Select the storage credential and go to the Workspaces tab.

  5. On the Workspaces tab, clear the All workspaces have access checkbox.

    If your storage credential is already bound to one or more workspaces, this checkbox is already cleared.

  6. Click Assign to workspaces and enter or find the workspaces you want to assign.

To revoke access, go to the Workspaces tab, select the workspace, and click Revoke. To allow access from all workspaces, select the All workspaces have access checkbox.

There are two Databricks CLI command groups and two steps required to assign a storage credential to a workspace.

In the following examples, replace <profile-name> with the name of your Databricks authentication configuration profile. It should include the value of a personal access token, in addition to the workspace instance name and workspace ID of the workspace where you generated the personal access token. See Databricks personal access token authentication.

  1. Use the storage-credentials command group’s update command to set the storage credential’s isolation mode to ISOLATED:

    databricks storage-credentials update <my-storage-credential> \
    --isolation-mode ISOLATED \
    --profile <profile-name>
    

    The default isolation-mode is OPEN to all workspaces attached to the metastore.

  2. Use the workspace-bindings command group’s update-bindings command to assign the workspaces to the storage credential:

    databricks workspace-bindings update-bindings storage-credential <my-storage-credential> \
    --json '{
      "add": [{"workspace_id": <workspace-id>}...],
      "remove": [{"workspace_id": <workspace-id>}...]
    }' --profile <profile-name>
    

    Use the "add" and "remove" properties to add or remove workspace bindings.

    Note

    Read-only binding (BINDING_TYPE_READ_ONLY) is not available for storage credentials. Therefore there is no reason to set binding_type for the storage credentials binding.

To list all workspace assignments for a storage credential, use the workspace-bindings command group’s get-bindings command:

databricks workspace-bindings get-bindings storage-credential <my-storage-credential> \
--profile <profile-name>

Unbind a storage credential from a workspace

Instructions for revoking workspace access to a storage credential using Catalog Explorer or the workspace-bindings CLI command group are included in Bind a storage credential to one or more workspaces.

Next steps

You can view, update, delete, and grant other users permission to use storage credentials. See Manage storage credentials.

You can define external locations using storage credentials. See Create a storage credential for connecting to AWS S3.