Secure access to S3 buckets across accounts using instance profiles with an AssumeRole policy

In AWS you can set up cross-account access, so the computing in one account can access a bucket in another account. One way to grant access, described in Secure access to S3 buckets using instance profiles, is to grant an account direct access to a bucket in another account. Another way to grant access to a bucket is to allow an account to assume a role in another account.

Consider AWS Account A with account ID <deployment-acct-id> and AWS Account B with account ID <bucket-owner-acct-id>. Account A is used when signing up with Databricks: EC2 services and the DBFS root bucket are managed by this account. Account B has a bucket <s3-bucket-name>.

This article provides the steps to configure Account A to use the AWS AssumeRole action to access S3 files in <s3-bucket-name> as a role in Account B. To enable this access you perform configuration in Account A and Account B and in the Databricks Admin Console. You also must either configure a Databricks cluster or add a configuration to a notebook that accesses the bucket.

Requirements

  • AWS administrator access to IAM roles and policies in the AWS account of the Databricks deployment and the AWS account of the S3 bucket.

  • Target S3 bucket.

  • If you intend to enable encryption for the S3 bucket, you must add the instance profile as a Key User for the KMS key provided in the configuration. See Configure KMS encryption for s3a:// paths.

In Account A, create role MyRoleA and attach policies

  1. Create a role named MyRoleA in Account A. The Instance Profile ARN is arn:aws:iam::<deployment-acct-id>:instance-profile/MyRoleA.

  2. Create a policy that says that a role in Account A can assume MyRoleB in Account B. Attach it to MyRoleA. Click Inline policy and paste in the policy:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1487884001000",
          "Effect": "Allow",
          "Action": [
            "sts:AssumeRole"
          ],
          "Resource": [
            "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB"
          ]
        }
      ]
    }
    
  3. Update the policy for the Account A role used to create clusters, adding the iam:PassRole action to MyRoleA:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1403287045000",
          "Effect": "Allow",
          "Action": [
            "ec2:AssociateDhcpOptions",
            "ec2:AssociateIamInstanceProfile",
            "ec2:AssociateRouteTable",
            "ec2:AttachInternetGateway",
            "ec2:AttachVolume",
            "ec2:AuthorizeSecurityGroupEgress",
            "ec2:AuthorizeSecurityGroupIngress",
            "ec2:CancelSpotInstanceRequests",
            "ec2:CreateDhcpOptions",
            "ec2:CreateInternetGateway",
            "ec2:CreateKeyPair",
            "ec2:CreateRoute",
            "ec2:CreateSecurityGroup",
            "ec2:CreateSubnet",
            "ec2:CreateTags",
            "ec2:CreateVolume",
            "ec2:CreateVpc",
            "ec2:CreateVpcPeeringConnection",
            "ec2:DeleteInternetGateway",
            "ec2:DeleteKeyPair",
            "ec2:DeleteRoute",
            "ec2:DeleteRouteTable",
            "ec2:DeleteSecurityGroup",
            "ec2:DeleteSubnet",
            "ec2:DeleteTags",
            "ec2:DeleteVolume",
            "ec2:DeleteVpc",
            "ec2:DescribeAvailabilityZones",
            "ec2:DescribeIamInstanceProfileAssociations",
            "ec2:DescribeInstanceStatus",
            "ec2:DescribeInstances",
            "ec2:DescribePrefixLists",
            "ec2:DescribeReservedInstancesOfferings",
            "ec2:DescribeRouteTables",
            "ec2:DescribeSecurityGroups",
            "ec2:DescribeSpotInstanceRequests",
            "ec2:DescribeSpotPriceHistory",
            "ec2:DescribeSubnets",
            "ec2:DescribeVolumes",
            "ec2:DescribeVpcs",
            "ec2:DetachInternetGateway",
            "ec2:DisassociateIamInstanceProfile",
            "ec2:ModifyVpcAttribute",
            "ec2:ReplaceIamInstanceProfileAssociation",
            "ec2:RequestSpotInstances",
            "ec2:RevokeSecurityGroupEgress",
            "ec2:RevokeSecurityGroupIngress",
            "ec2:RunInstances",
            "ec2:TerminateInstances"
          ],
          "Resource": [
              "*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": "iam:PassRole",
          "Resource": [
            "arn:aws:iam::<deployment-acct-id>:role/MyRoleA"
          ]
        }
      ]
    }
    

    Note

    If your account is on the E2 version of the Databricks platform, you can omit ec2:CreateKeyPair and ec2:DeleteKeyPair. If you are not sure of your account’s version, contact your Databricks representative.

Step 2: In Account B, create role MyRoleB and attach policies

  1. Create a role named MyRoleB. The Role ARN is arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB.

  2. Edit the trust relationship of role MyRoleB to allow a role MyRoleA in Account A to assume a role in Account B. Select IAM > Roles > MyRoleB > Trust relationships > Edit trust relationship and enter:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "arn:aws:iam::<deployment-acct-id>:role/MyRoleA"
            ]
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    
  3. Create a bucket policy for the bucket <s3-bucket-name>. Select S3 > <s3-bucket-name> > Permissions > Bucket Policy. Include the role (Principal) MyRoleB in the bucket policy:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
                "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB"
            ]
          },
          "Action": [
            "s3:GetBucketLocation",
            "s3:ListBucket"
          ],
          "Resource": "arn:aws:s3:::<s3-bucket-name>"
        },
        {
          "Effect": "Allow",
          "Principal": {
              "AWS": [
                  "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB"
              ]
          },
          "Action": [
            "s3:PutObject",
            "s3:PutObjectAcl",
            "s3:GetObject",
            "s3:DeleteObject"
          ],
          "Resource": "arn:aws:s3:::<s3-bucket-name>/*"
        }
      ]
    }
    

Tip

If you are prompted with a Principal error, make sure that you modified only the Trust relationship policy.

Step 3: Add MyRoleA to the Databricks workspace

In the Databricks Admin Console, add the instance profile MyRoleA to Databricks using the MyRoleA instance profile ARN arn:aws:iam::<deployment-acct-id>:instance-profile/MyRoleA from step 1.

Step 4: Configure cluster with MyRoleA

  1. Select or create a cluster.

  2. Open the Advanced Options section.

  3. On the Instances tab, select the instance profile MyRoleA.

  4. On the Spark tab, set the assume role credential provider and role ARN MyRoleB:

    Note

    Databricks Runtime 7.3 LTS and above support configuring the S3A filesystem by using open-source Hadoop options. You can configure global properties and per-bucket properties.

    To set it globally for all buckets:

    spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
    spark.hadoop.fs.s3a.assumed.role.arn arn:aws:iam::<bucket-owner-account-id>:role/MyRoleB
    

    To set it for a specific bucket:

    spark.hadoop.fs.s3a.bucket.<s3-bucket-name>.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
    spark.hadoop.fs.s3a.bucket.<s3-bucket-name>.assumed.role.arn arn:aws:iam::<bucket-owner-account-id>:role/MyRoleB
    
  5. Start the cluster.

  6. Attach a notebook to the cluster.

  7. Verify that you can access <s3-bucket-name> by running the following command:

    dbutils.fs.ls("s3a://<s3-bucket-name>/")
    

Automated configuration using Terraform

You can use Databricks Terraform provider to automatically configure AWS IAM roles and their cluster attachment.

As shown in this example configuration, first define two variables:

variable "prefix" {
  default = "changeme"
}

variable "databricks_account_id" {
  description = "Account ID. You can get your account ID in the bottom left corner of the account console. See https://accounts.cloud.databricks.com"
}

Create a bucket using aws_s3_bucket:

resource "aws_s3_bucket" "ds" {
  bucket = "${var.prefix}-ds"
  acl    = "private"
  versioning {
    enabled = false
  }
  force_destroy = true
  tags = merge(var.tags, {
    Name = "${var.prefix}-ds"
  })
}

Create an IAM role for data access using aws_iam_role:

data "aws_iam_policy_document" "assume_role_for_ec2" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      identifiers = ["ec2.amazonaws.com"]
      type        = "Service"
    }
  }
}

resource "aws_iam_role" "data_role" {
  name               = "${var.prefix}-first-ec2s3"
  description        = "(${var.prefix}) EC2 Assume Role role for S3 access"
  assume_role_policy = data.aws_iam_policy_document.assume_role_for_ec2.json
  tags               = var.tags
}

Create a bucket policy with a databricks_aws_bucket_policy definition that gives full access to this bucket. Apply an inline S3 bucket policy to the newly-created bucket with aws_s3_bucket_policy:

data "databricks_aws_bucket_policy" "ds" {
  provider         = databricks.mws
  full_access_role = aws_iam_role.data_role.arn
  bucket           = aws_s3_bucket.ds.bucket
}

resource "aws_s3_bucket_policy" "ds" {
  bucket = aws_s3_bucket.ds.id
  policy = data.databricks_aws_bucket_policy.ds.json
}

Create a cross-account policy, which allows Databricks to pass a list of data roles using an aws_iam_policy:

data "databricks_aws_crossaccount_policy" "this" {
  pass_roles = [aws_iam_role.data_role.arn]
}

resource "aws_iam_policy" "cross_account_policy" {
  name   = "${var.prefix}-crossaccount-iam-policy"
  policy = data.databricks_aws_crossaccount_policy.this.json
}

Allow Databricks to perform actions within your account through configuring a trust relationship using an aws_iam_role_policy_attachment. Grant Databricks full access to VPC resources and attach cross-account policy to cross-account role:

data "databricks_aws_assume_role_policy" "this" {
    external_id = var.databricks_account_id
}

resource "aws_iam_role" "cross_account" {
  name               = "${var.prefix}-crossaccount-iam-role"
  assume_role_policy = data.databricks_aws_assume_role_policy.this.json
  description        = "Grants Databricks full access to VPC resources"
}

resource "aws_iam_role_policy_attachment" "cross_account" {
  policy_arn = aws_iam_policy.cross_account_policy.arn
  role       = aws_iam_role.cross_account.name
}

Register cross-account role in E2 workspace setup:

resource "databricks_mws_credentials" "this" {
  provider         = databricks.mws
  account_id       = var.databricks_account_id
  credentials_name = "${var.prefix}-creds"
  role_arn         = aws_iam_role.cross_account.arn
}

Once the workspace is created, register your data role with aws_iam_instance_profile as databricks_instance_profile:

resource "aws_iam_instance_profile" "this" {
  name = "${var.prefix}-first-profile"
  role = aws_iam_role.data_role.name
}

resource "databricks_instance_profile" "ds" {
  instance_profile_arn = aws_iam_instance_profile.this.arn
}

For the last step, create a /mnt/experiments mount point and a cluster with a specified instance profile:

resource "databricks_aws_s3_mount" "this" {
    instance_profile = databricks_instance_profile.ds.id
    s3_bucket_name = aws_s3_bucket.this.bucket
    mount_name = "experiments"
}

data "databricks_node_type" "smallest" {
  local_disk = true
}

data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "shared_autoscaling" {
  cluster_name            = "Shared Autoscaling"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20

  autoscale {
    min_workers = 1
    max_workers = 50
  }

  aws_attributes {
    instance_profile_arn = databricks_instance_profile.ds.id
  }
}