Access cross-account S3 buckets with an AssumeRole policy

In AWS you can set up cross-account access, so the computing in one account can access a bucket in another account. One way to grant access, described in Tutorial: Configure S3 access with an instance profile, is to grant an account direct access to a bucket in another account. Another way to grant access to a bucket is to allow an account to assume a role in another account.

Consider AWS Account A with account ID <deployment-acct-id> and AWS Account B with account ID <bucket-owner-acct-id>. Account A is used when signing up with Databricks: EC2 services and the Databricks Filesystem root bucket are managed by this account. Account B has a bucket <s3-bucket-name>.

This article provides the steps to configure Account A to use the AWS AssumeRole action to access S3 files in <s3-bucket-name> as a role in Account B. To enable this access you perform configuration in Account A and Account B and in the Databricks admin settings. You also must either configure a Databricks cluster or add a configuration to a notebook that accesses the bucket.

Requirements

  • AWS administrator access to IAM roles and policies in the AWS account of the Databricks deployment and the AWS account of the S3 bucket.

  • Target S3 bucket.

  • If you intend to enable encryption for the S3 bucket, you must add the instance profile as a Key User for the KMS key provided in the configuration. See Configure encryption for S3 with KMS.

Step 1: In Account A, create role MyRoleA and attach policies

  1. Create a role named MyRoleA in Account A. The Instance Profile ARN is arn:aws:iam::<deployment-acct-id>:instance-profile/MyRoleA.

  2. Create a policy that says that a role in Account A can assume MyRoleB in Account B. Attach it to MyRoleA. Click Inline policy and paste in the policy:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1487884001000",
          "Effect": "Allow",
          "Action": [
            "sts:AssumeRole"
          ],
          "Resource": [
            "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB"
          ]
        }
      ]
    }
    
  3. Update the policy for the Account A role used to create clusters, adding the iam:PassRole action to MyRoleA:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1403287045000",
          "Effect": "Allow",
          "Action": [
            "ec2:AssociateDhcpOptions",
            "ec2:AssociateIamInstanceProfile",
            "ec2:AssociateRouteTable",
            "ec2:AttachInternetGateway",
            "ec2:AttachVolume",
            "ec2:AuthorizeSecurityGroupEgress",
            "ec2:AuthorizeSecurityGroupIngress",
            "ec2:CancelSpotInstanceRequests",
            "ec2:CreateDhcpOptions",
            "ec2:CreateInternetGateway",
            "ec2:CreateKeyPair",
            "ec2:CreateRoute",
            "ec2:CreateSecurityGroup",
            "ec2:CreateSubnet",
            "ec2:CreateTags",
            "ec2:CreateVolume",
            "ec2:CreateVpc",
            "ec2:CreateVpcPeeringConnection",
            "ec2:DeleteInternetGateway",
            "ec2:DeleteKeyPair",
            "ec2:DeleteRoute",
            "ec2:DeleteRouteTable",
            "ec2:DeleteSecurityGroup",
            "ec2:DeleteSubnet",
            "ec2:DeleteTags",
            "ec2:DeleteVolume",
            "ec2:DeleteVpc",
            "ec2:DescribeAvailabilityZones",
            "ec2:DescribeIamInstanceProfileAssociations",
            "ec2:DescribeInstanceStatus",
            "ec2:DescribeInstances",
            "ec2:DescribePrefixLists",
            "ec2:DescribeReservedInstancesOfferings",
            "ec2:DescribeRouteTables",
            "ec2:DescribeSecurityGroups",
            "ec2:DescribeSpotInstanceRequests",
            "ec2:DescribeSpotPriceHistory",
            "ec2:DescribeSubnets",
            "ec2:DescribeVolumes",
            "ec2:DescribeVpcs",
            "ec2:DetachInternetGateway",
            "ec2:DisassociateIamInstanceProfile",
            "ec2:ModifyVpcAttribute",
            "ec2:ReplaceIamInstanceProfileAssociation",
            "ec2:RequestSpotInstances",
            "ec2:RevokeSecurityGroupEgress",
            "ec2:RevokeSecurityGroupIngress",
            "ec2:RunInstances",
            "ec2:TerminateInstances"
          ],
          "Resource": [
              "*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": "iam:PassRole",
          "Resource": [
            "arn:aws:iam::<deployment-acct-id>:role/MyRoleA"
          ]
        }
      ]
    }
    

    Note

    If your account is on the E2 version of the Databricks platform, you can omit ec2:CreateKeyPair and ec2:DeleteKeyPair.

Step 2: In Account B, create role MyRoleB and attach policies

  1. Create a role named MyRoleB. The Role ARN is arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB.

  2. Edit the trust relationship of role MyRoleB to allow a role MyRoleA in Account A to assume a role in Account B. Select IAM > Roles > MyRoleB > Trust relationships > Edit trust relationship and enter:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "arn:aws:iam::<deployment-acct-id>:role/MyRoleA"
            ]
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    
  3. Create a bucket policy for the bucket <s3-bucket-name>. Select S3 > <s3-bucket-name> > Permissions > Bucket Policy. Include the role (Principal) MyRoleB in the bucket policy:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
                "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB"
            ]
          },
          "Action": [
            "s3:GetBucketLocation",
            "s3:ListBucket"
          ],
          "Resource": "arn:aws:s3:::<s3-bucket-name>"
        },
        {
          "Effect": "Allow",
          "Principal": {
              "AWS": [
                  "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB"
              ]
          },
          "Action": [
            "s3:PutObject",
            "s3:PutObjectAcl",
            "s3:GetObject",
            "s3:DeleteObject"
          ],
          "Resource": "arn:aws:s3:::<s3-bucket-name>/*"
        }
      ]
    }
    

Tip

If you are prompted with a Principal error, make sure that you modified only the Trust relationship policy.

Step 3: Add MyRoleA to the Databricks workspace

In the Databricks admin settings, add the instance profile MyRoleA to Databricks using the MyRoleA instance profile ARN arn:aws:iam::<deployment-acct-id>:instance-profile/MyRoleA from step 1.

Step 4: Configure cluster with MyRoleA

  1. Select or create a cluster.

  2. Open the Advanced Options section.

  3. On the Instances tab, select the instance profile MyRoleA.

  4. On the Spark tab, set the assume role credential provider and role ARN MyRoleB:

    Note

    Databricks Runtime 7.3 LTS and above support configuring the S3A filesystem by using open-source Hadoop options. You can configure global properties and per-bucket properties.

    To set it globally for all buckets:

    fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
    fs.s3a.assumed.role.arn arn:aws:iam::<bucket-owner-account-id>:role/MyRoleB
    

    To set it for a specific bucket:

    fs.s3a.bucket.<s3-bucket-name>.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
    fs.s3a.bucket.<s3-bucket-name>.assumed.role.arn arn:aws:iam::<bucket-owner-account-id>:role/MyRoleB
    
  5. Start the cluster.

  6. Attach a notebook to the cluster.

  7. Verify that you can access <s3-bucket-name> by running the following command:

    dbutils.fs.ls("s3a://<s3-bucket-name>/")
    

Step 5: Mount cross-account bucket with AssumeRole

You can mount your cross-account bucket to use relative file paths for access to remote data. See Mount a bucket using instance profiles with the AssumeRole policy.

Automated configuration using Terraform

You can use Databricks Terraform provider to automatically configure AWS IAM roles and their cluster attachment.

As shown in this example configuration, first define two variables:

variable "prefix" {
  default = "changeme"
}

variable "databricks_account_id" {
  description = "Account ID. You can get your account ID in the bottom left corner of the account console. See https://accounts.cloud.databricks.com"
}

Create a bucket using aws_s3_bucket:

resource "aws_s3_bucket" "ds" {
  bucket = "${var.prefix}-ds"
  acl    = "private"
  versioning {
    enabled = false
  }
  force_destroy = true
  tags = merge(var.tags, {
    Name = "${var.prefix}-ds"
  })
}

Create an IAM role for data access using aws_iam_role:

data "aws_iam_policy_document" "assume_role_for_ec2" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      identifiers = ["ec2.amazonaws.com"]
      type        = "Service"
    }
  }
}

resource "aws_iam_role" "data_role" {
  name               = "${var.prefix}-first-ec2s3"
  description        = "(${var.prefix}) EC2 Assume Role role for S3 access"
  assume_role_policy = data.aws_iam_policy_document.assume_role_for_ec2.json
  tags               = var.tags
}

Create a bucket policy with a databricks_aws_bucket_policy definition that gives full access to this bucket. Apply an inline S3 bucket policy to the newly-created bucket with aws_s3_bucket_policy:

data "databricks_aws_bucket_policy" "ds" {
  provider         = databricks.mws
  full_access_role = aws_iam_role.data_role.arn
  bucket           = aws_s3_bucket.ds.bucket
}

resource "aws_s3_bucket_policy" "ds" {
  bucket = aws_s3_bucket.ds.id
  policy = data.databricks_aws_bucket_policy.ds.json
}

Create a cross-account policy, which allows Databricks to pass a list of data roles using an aws_iam_policy:

data "databricks_aws_crossaccount_policy" "this" {
  pass_roles = [aws_iam_role.data_role.arn]
}

resource "aws_iam_policy" "cross_account_policy" {
  name   = "${var.prefix}-crossaccount-iam-policy"
  policy = data.databricks_aws_crossaccount_policy.this.json
}

Allow Databricks to perform actions within your account through configuring a trust relationship using an aws_iam_role_policy_attachment. Grant Databricks full access to VPC resources and attach cross-account policy to cross-account role:

data "databricks_aws_assume_role_policy" "this" {
    external_id = var.databricks_account_id
}

resource "aws_iam_role" "cross_account" {
  name               = "${var.prefix}-crossaccount-iam-role"
  assume_role_policy = data.databricks_aws_assume_role_policy.this.json
  description        = "Grants Databricks full access to VPC resources"
}

resource "aws_iam_role_policy_attachment" "cross_account" {
  policy_arn = aws_iam_policy.cross_account_policy.arn
  role       = aws_iam_role.cross_account.name
}

Register cross-account role in E2 workspace setup:

resource "databricks_mws_credentials" "this" {
  provider         = databricks.mws
  account_id       = var.databricks_account_id
  credentials_name = "${var.prefix}-creds"
  role_arn         = aws_iam_role.cross_account.arn
}

Once the workspace is created, register your data role with aws_iam_instance_profile as databricks_instance_profile:

resource "aws_iam_instance_profile" "this" {
  name = "${var.prefix}-first-profile"
  role = aws_iam_role.data_role.name
}

resource "databricks_instance_profile" "ds" {
  instance_profile_arn = aws_iam_instance_profile.this.arn
}

For the last step, create a /mnt/experiments mount point and a cluster with a specified instance profile:

resource "databricks_aws_s3_mount" "this" {
    instance_profile = databricks_instance_profile.ds.id
    s3_bucket_name = aws_s3_bucket.this.bucket
    mount_name = "experiments"
}

data "databricks_node_type" "smallest" {
  local_disk = true
}

data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "shared_autoscaling" {
  cluster_name            = "Shared Autoscaling"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20

  autoscale {
    min_workers = 1
    max_workers = 50
  }

  aws_attributes {
    instance_profile_arn = databricks_instance_profile.ds.id
  }
}