Secure access to S3 buckets across accounts using instance profiles with an AssumeRole policy
In AWS you can set up cross-account access, so the computing in one account can access a bucket in another account. One way to grant access, described in Secure access to S3 buckets using instance profiles, is to grant an account direct access to a bucket in another account. Another way to grant access to a bucket is to allow an account to assume a role in another account.
Consider AWS Account A with account ID <deployment-acct-id>
and AWS Account B with account ID <bucket-owner-acct-id>
. Account A is used when signing up with Databricks: EC2 services and the DBFS root bucket are managed by this account. Account B has a bucket <s3-bucket-name>
.
This article provides the steps to configure Account A to use the AWS AssumeRole action to access S3 files in <s3-bucket-name>
as a role in Account B. To enable this access you perform configuration in Account A and Account B and in the Databricks Admin Console. You also must either configure a Databricks cluster or add a configuration to a notebook that accesses the bucket.
Requirements
- AWS administrator access to IAM roles and policies in the AWS account of the Databricks deployment and the AWS account of the S3 bucket.
- Target S3 bucket.
- If you intend to enable encryption for the S3 bucket, you must add the instance profile as a Key User for the KMS key provided in the configuration. See Configure KMS encryption for s3a:// paths.
Step 1: In Account A, create role MyRoleA
and attach policies
Create a role named
MyRoleA
in Account A. The Instance Profile ARN isarn:aws:iam::<deployment-acct-id>:instance-profile/MyRoleA
.Create a policy that says that a role in Account A can assume
MyRoleB
in Account B. Attach it toMyRoleA
. Clickand paste in the policy:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1487884001000", "Effect": "Allow", "Action": [ "sts:AssumeRole" ], "Resource": [ "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB" ] } ] }
Update the policy for the Account A role used to create clusters, adding the
iam:PassRole
action toMyRoleA
:{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1403287045000", "Effect": "Allow", "Action": [ "ec2:AssociateDhcpOptions", "ec2:AssociateIamInstanceProfile", "ec2:AssociateRouteTable", "ec2:AttachInternetGateway", "ec2:AttachVolume", "ec2:AuthorizeSecurityGroupEgress", "ec2:AuthorizeSecurityGroupIngress", "ec2:CancelSpotInstanceRequests", "ec2:CreateDhcpOptions", "ec2:CreateInternetGateway", "ec2:CreateKeyPair", "ec2:CreatePlacementGroup", "ec2:CreateRoute", "ec2:CreateSecurityGroup", "ec2:CreateSubnet", "ec2:CreateTags", "ec2:CreateVolume", "ec2:CreateVpc", "ec2:CreateVpcPeeringConnection", "ec2:DeleteInternetGateway", "ec2:DeleteKeyPair", "ec2:DeletePlacementGroup", "ec2:DeleteRoute", "ec2:DeleteRouteTable", "ec2:DeleteSecurityGroup", "ec2:DeleteSubnet", "ec2:DeleteTags", "ec2:DeleteVolume", "ec2:DeleteVpc", "ec2:DescribeAvailabilityZones", "ec2:DescribeIamInstanceProfileAssociations", "ec2:DescribeInstanceStatus", "ec2:DescribeInstances", "ec2:DescribePlacementGroups", "ec2:DescribePrefixLists", "ec2:DescribeReservedInstancesOfferings", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSpotInstanceRequests", "ec2:DescribeSpotPriceHistory", "ec2:DescribeSubnets", "ec2:DescribeVolumes", "ec2:DescribeVpcs", "ec2:DetachInternetGateway", "ec2:DisassociateIamInstanceProfile", "ec2:ModifyVpcAttribute", "ec2:ReplaceIamInstanceProfileAssociation", "ec2:RequestSpotInstances", "ec2:RevokeSecurityGroupEgress", "ec2:RevokeSecurityGroupIngress", "ec2:RunInstances", "ec2:TerminateInstances" ], "Resource": [ "*" ] }, { "Effect": "Allow", "Action": "iam:PassRole", "Resource": [ "arn:aws:iam::<deployment-acct-id>:role/MyRoleA" ] } ] }
Step 2: In Account B, create role MyRoleB
and attach policies
Create a role named
MyRoleB
. The Role ARN isarn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB
.Edit the trust relationship of role
MyRoleB
to allow a roleMyRoleA
in Account A to assume a role in Account B. Select IAM > Roles > MyRoleB > Trust relationships > Edit trust relationship and enter:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<deployment-acct-id>:role/MyRoleA" ] }, "Action": "sts:AssumeRole" } ] }
Create a bucket policy for the bucket
<s3-bucket-name>
. Select S3 > <s3-bucket-name> > Permissions > Bucket Policy and enter:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<s3-bucket-name>" ] }, { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:PutObjectAcl", "s3:GetObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<s3-bucket-name>/*" ] } ] }
Add the role (Principal)
MyRoleB
to the bucket policy.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB" ] }, "Action": [ "s3:GetBucketLocation", "s3:ListBucket" ], "Resource": "arn:aws:s3:::<s3-bucket-name>" }, { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB" ] }, "Action": [ "s3:PutObject", "s3:PutObjectAcl", "s3:GetObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::<s3-bucket-name>/*" } ] }
Tip
If you are prompted with a Principal error
, make sure that you modified only the Trust relationship policy.
Step 3: Add MyRoleA
to the Databricks workspace
In the Databricks Admin Console, add the instance profile MyRoleA
to Databricks using the
MyRoleA
instance profile ARN arn:aws:iam::<deployment-acct-id>:instance-profile/MyRoleA
from step 1.
Step 4: Configure cluster with MyRoleA
Select or create a cluster.
Open the Advanced Options section.
On the Instances tab, select the instance profile
MyRoleA
.On the Spark tab, optionally set the
assumeRole
credential type and assume role ARNMyRoleB
:spark.hadoop.fs.s3a.credentialsType AssumeRole spark.hadoop.fs.s3a.stsAssumeRole.arn arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl
Start the cluster.
Attach a notebook to the cluster.
Do one of the following:
Optionally mount
<s3-bucket-name>
on DBFS with extra configurations:dbutils.fs.mount("s3a://<s3-bucket-name>", "/mnt/<s3-bucket-name>", extraConfigs = Map( "fs.s3a.credentialsType" -> "AssumeRole", "fs.s3a.stsAssumeRole.arn" -> "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB", "spark.hadoop.fs.s3a.acl.default" -> "BucketOwnerFullControl" ) )
Note
This is the recommended option.
If you did not set the
assumeRole
credential type and assume role ARN in the Spark configuration of the cluster or mount the S3 bucket, you can do it in the first command in a notebook:sc.hadoopConfiguration.set("fs.s3a.credentialsType", "AssumeRole") sc.hadoopConfiguration.set("fs.s3a.stsAssumeRole.arn", "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB") sc.hadoopConfiguration.set("fs.s3a.acl.default", "BucketOwnerFullControl")
Verify that you can access
<s3-bucket-name>
, using the following command:dbutils.fs.ls("/mnt/<s3-bucket-name>")
or
dbutils.fs.ls("s3a://<s3-bucket-name>/")