Access cross-account S3 buckets with an AssumeRole policy
In AWS you can set up cross-account access, so the computing in one account can access a bucket in another account. One way to grant access, described in Tutorial: Configure S3 access with an instance profile, is to grant an account direct access to a bucket in another account. Another way to grant access to a bucket is to allow an account to assume a role in another account.
Consider AWS Account A with account ID <deployment-acct-id>
and AWS Account B with account ID <bucket-owner-acct-id>
. Account A is used when signing up with Databricks: EC2 services and the Databricks Filesystem root bucket are managed by this account. Account B has a bucket <s3-bucket-name>
.
This article provides the steps to configure Account A to use the AWS AssumeRole action to access S3 files in <s3-bucket-name>
as a role in Account B. To enable this access you perform configuration in Account A and Account B and in the Databricks admin settings. You also must either configure a Databricks cluster or add a configuration to a notebook that accesses the bucket.
Requirements
AWS administrator access to IAM roles and policies in the AWS account of the Databricks deployment and the AWS account of the S3 bucket.
Target S3 bucket.
If you intend to enable encryption for the S3 bucket, you must add the instance profile as a Key User for the KMS key provided in the configuration. See Configure encryption for S3 with KMS.
Step 1: In Account A, create role MyRoleA
and attach policies
Create a role named
MyRoleA
in Account A. The Instance Profile ARN isarn:aws:iam::<deployment-acct-id>:instance-profile/MyRoleA
.Create a policy that says that a role in Account A can assume
MyRoleB
in Account B. Attach it toMyRoleA
. Click and paste in the policy:{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1487884001000", "Effect": "Allow", "Action": [ "sts:AssumeRole" ], "Resource": [ "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB" ] } ] }
Update the policy for the Account A role used to create clusters, adding the
iam:PassRole
action toMyRoleA
:{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1403287045000", "Effect": "Allow", "Action": [ "ec2:AssociateDhcpOptions", "ec2:AssociateIamInstanceProfile", "ec2:AssociateRouteTable", "ec2:AttachInternetGateway", "ec2:AttachVolume", "ec2:AuthorizeSecurityGroupEgress", "ec2:AuthorizeSecurityGroupIngress", "ec2:CancelSpotInstanceRequests", "ec2:CreateDhcpOptions", "ec2:CreateInternetGateway", "ec2:CreateKeyPair", "ec2:CreateRoute", "ec2:CreateSecurityGroup", "ec2:CreateSubnet", "ec2:CreateTags", "ec2:CreateVolume", "ec2:CreateVpc", "ec2:CreateVpcPeeringConnection", "ec2:DeleteInternetGateway", "ec2:DeleteKeyPair", "ec2:DeleteRoute", "ec2:DeleteRouteTable", "ec2:DeleteSecurityGroup", "ec2:DeleteSubnet", "ec2:DeleteTags", "ec2:DeleteVolume", "ec2:DeleteVpc", "ec2:DescribeAvailabilityZones", "ec2:DescribeIamInstanceProfileAssociations", "ec2:DescribeInstanceStatus", "ec2:DescribeInstances", "ec2:DescribePrefixLists", "ec2:DescribeReservedInstancesOfferings", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSpotInstanceRequests", "ec2:DescribeSpotPriceHistory", "ec2:DescribeSubnets", "ec2:DescribeVolumes", "ec2:DescribeVpcs", "ec2:DetachInternetGateway", "ec2:DisassociateIamInstanceProfile", "ec2:ModifyVpcAttribute", "ec2:ReplaceIamInstanceProfileAssociation", "ec2:RequestSpotInstances", "ec2:RevokeSecurityGroupEgress", "ec2:RevokeSecurityGroupIngress", "ec2:RunInstances", "ec2:TerminateInstances" ], "Resource": [ "*" ] }, { "Effect": "Allow", "Action": "iam:PassRole", "Resource": [ "arn:aws:iam::<deployment-acct-id>:role/MyRoleA" ] } ] }
Note
If your account is on the E2 version of the Databricks platform, you can omit
ec2:CreateKeyPair
andec2:DeleteKeyPair
.
Step 2: In Account B, create role MyRoleB
and attach policies
Create a role named
MyRoleB
. The Role ARN isarn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB
.Edit the trust relationship of role
MyRoleB
to allow a roleMyRoleA
in Account A to assume a role in Account B. Select IAM > Roles > MyRoleB > Trust relationships > Edit trust relationship and enter:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<deployment-acct-id>:role/MyRoleA" ] }, "Action": "sts:AssumeRole" } ] }
Create a bucket policy for the bucket
<s3-bucket-name>
. Select S3 ><s3-bucket-name>
> Permissions > Bucket Policy. Include the role (Principal)MyRoleB
in the bucket policy:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB" ] }, "Action": [ "s3:GetBucketLocation", "s3:ListBucket" ], "Resource": "arn:aws:s3:::<s3-bucket-name>" }, { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB" ] }, "Action": [ "s3:PutObject", "s3:PutObjectAcl", "s3:GetObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::<s3-bucket-name>/*" } ] }
Tip
If you are prompted with a Principal error
, make sure that you modified only the Trust relationship policy.
Step 3: Add MyRoleA
to the Databricks workspace
In the Databricks admin settings, add the instance profile MyRoleA
to Databricks using the
MyRoleA
instance profile ARN arn:aws:iam::<deployment-acct-id>:instance-profile/MyRoleA
from step 1.
Step 4: Configure cluster with MyRoleA
Select or create a cluster.
Open the Advanced Options section.
On the Instances tab, select the instance profile
MyRoleA
.On the Spark tab, set the assume role credential provider and role ARN
MyRoleB
:Note
Databricks Runtime 7.3 LTS and above support configuring the S3A filesystem by using open-source Hadoop options. You can configure global properties and per-bucket properties.
To set it globally for all buckets:
fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider fs.s3a.assumed.role.arn arn:aws:iam::<bucket-owner-account-id>:role/MyRoleB
To set it for a specific bucket:
fs.s3a.bucket.<s3-bucket-name>.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider fs.s3a.bucket.<s3-bucket-name>.assumed.role.arn arn:aws:iam::<bucket-owner-account-id>:role/MyRoleB
Start the cluster.
Attach a notebook to the cluster.
Verify that you can access
<s3-bucket-name>
by running the following command:dbutils.fs.ls("s3a://<s3-bucket-name>/")
Step 5: Mount cross-account bucket with AssumeRole
You can mount your cross-account bucket to use relative file paths for access to remote data. See Mount a bucket using instance profiles with the AssumeRole policy.
Automated configuration using Terraform
You can use Databricks Terraform provider to automatically configure AWS IAM roles and their cluster attachment.
As shown in this example configuration, first define two variables:
variable "prefix" {
default = "changeme"
}
variable "databricks_account_id" {
description = "Account ID. You can get your account ID in the bottom left corner of the account console. See https://accounts.cloud.databricks.com"
}
Create a bucket using aws_s3_bucket:
resource "aws_s3_bucket" "ds" {
bucket = "${var.prefix}-ds"
acl = "private"
versioning {
enabled = false
}
force_destroy = true
tags = merge(var.tags, {
Name = "${var.prefix}-ds"
})
}
Create an IAM role for data access using aws_iam_role:
data "aws_iam_policy_document" "assume_role_for_ec2" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
identifiers = ["ec2.amazonaws.com"]
type = "Service"
}
}
}
resource "aws_iam_role" "data_role" {
name = "${var.prefix}-first-ec2s3"
description = "(${var.prefix}) EC2 Assume Role role for S3 access"
assume_role_policy = data.aws_iam_policy_document.assume_role_for_ec2.json
tags = var.tags
}
Create a bucket policy with a databricks_aws_bucket_policy definition that gives full access to this bucket. Apply an inline S3 bucket policy to the newly-created bucket with aws_s3_bucket_policy:
data "databricks_aws_bucket_policy" "ds" {
provider = databricks.mws
full_access_role = aws_iam_role.data_role.arn
bucket = aws_s3_bucket.ds.bucket
}
resource "aws_s3_bucket_policy" "ds" {
bucket = aws_s3_bucket.ds.id
policy = data.databricks_aws_bucket_policy.ds.json
}
Create a cross-account policy, which allows Databricks to pass a list of data roles using an aws_iam_policy:
data "databricks_aws_crossaccount_policy" "this" {
pass_roles = [aws_iam_role.data_role.arn]
}
resource "aws_iam_policy" "cross_account_policy" {
name = "${var.prefix}-crossaccount-iam-policy"
policy = data.databricks_aws_crossaccount_policy.this.json
}
Allow Databricks to perform actions within your account through configuring a trust relationship using an aws_iam_role_policy_attachment. Grant Databricks full access to VPC resources and attach cross-account policy to cross-account role:
data "databricks_aws_assume_role_policy" "this" {
external_id = var.databricks_account_id
}
resource "aws_iam_role" "cross_account" {
name = "${var.prefix}-crossaccount-iam-role"
assume_role_policy = data.databricks_aws_assume_role_policy.this.json
description = "Grants Databricks full access to VPC resources"
}
resource "aws_iam_role_policy_attachment" "cross_account" {
policy_arn = aws_iam_policy.cross_account_policy.arn
role = aws_iam_role.cross_account.name
}
Register cross-account role in E2 workspace setup:
resource "databricks_mws_credentials" "this" {
provider = databricks.mws
account_id = var.databricks_account_id
credentials_name = "${var.prefix}-creds"
role_arn = aws_iam_role.cross_account.arn
}
Once the workspace is created, register your data role with aws_iam_instance_profile as databricks_instance_profile:
resource "aws_iam_instance_profile" "this" {
name = "${var.prefix}-first-profile"
role = aws_iam_role.data_role.name
}
resource "databricks_instance_profile" "ds" {
instance_profile_arn = aws_iam_instance_profile.this.arn
}
For the last step, create a /mnt/experiments
mount point and a cluster with a specified instance profile:
resource "databricks_aws_s3_mount" "this" {
instance_profile = databricks_instance_profile.ds.id
s3_bucket_name = aws_s3_bucket.this.bucket
mount_name = "experiments"
}
data "databricks_node_type" "smallest" {
local_disk = true
}
data "databricks_spark_version" "latest_lts" {
long_term_support = true
}
resource "databricks_cluster" "shared_autoscaling" {
cluster_name = "Shared Autoscaling"
spark_version = data.databricks_spark_version.latest_lts.id
node_type_id = data.databricks_node_type.smallest.id
autotermination_minutes = 20
autoscale {
min_workers = 1
max_workers = 50
}
aws_attributes {
instance_profile_arn = databricks_instance_profile.ds.id
}
}