Secure Access to S3 Buckets Using IAM Roles

An IAM role is an AWS identity with permission policies that determine what the identity can and cannot do in AWS. IAM roles allow you to access your data from Databricks clusters without having to embed your AWS keys in notebooks. In order to access AWS resources securely, you can launch Databricks clusters with IAM roles. This topic explains how to set up IAM roles and use them in Databricks to securely access S3 buckets.

Requirements

  • AWS administrator access to IAM roles and policies in the AWS account of the Databricks deployment and the AWS account of the S3 bucket.
  • Target S3 bucket. This bucket must belong to the same AWS account as the Databricks deployment or there must be a cross-account bucket policy that allows access to this bucket from the AWS account of the Databricks deployment.
  • If you intend to enable encryption for the S3 bucket, you must add the IAM role as a Key User for the KMS key provided in the configuration. See Configure KMS encryption.

Step 1: Create an IAM role and policy to access an S3 bucket

  1. In the AWS console, go to the IAM service.

  2. Click the Roles tab in the sidebar.

  3. Click Create role.

    1. Under Select type of trusted entity, select AWS service.

    2. Click the EC2 service.

    3. Under Select your use case, click EC2.

      ../../../_images/select-service.png
    4. Click Next: Permissions and click Next: Review.

    5. In the Role name field, type a role name.

    6. Click Create role. The list of roles displays.

  4. In the role list, click the role.

  5. Add an inline policy to the S3 bucket.

    1. In the Permissions tab, click Inline policy.

    2. Click the JSON tab.

    3. Copy this policy and set <s3-bucket-name> to the name of your bucket.

      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Action": [
              "s3:ListBucket"
            ],
           "Resource": [
              "arn:aws:s3:::<s3-bucket-name>"
            ]
          },
          {
            "Effect": "Allow",
            "Action": [
              "s3:PutObject",
              "s3:GetObject",
              "s3:DeleteObject",
              "s3:PutObjectAcl"
            ],
            "Resource": [
               "arn:aws:s3:::<s3-bucket-name>/*"
            ]
          }
        ]
      }
      
    4. Click Review policy.

    5. In the Name field, type a policy name.

    6. Click Create policy.

  6. In the role summary, copy the Instance Profile ARN.

    ../../../_images/copy-instanceprofile-arn.png

Step 2: Create a bucket policy for the target S3 bucket

At a minimum, the S3 policy must include the ListBucket and GetObject actions.

../../../_images/bucket-policy.png
  1. Paste in a policy. A sample cross-account bucket IAM policy could be the following, replacing <aws-account-id-databricks> with the AWS account ID where the Databricks environment is deployed, <iam-role-for-s3-access> with the role you created in Step 1, and <s3-bucket-name> with the bucket name.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Example permissions",
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
          },
          "Action": [
            "s3:GetBucketLocation",
            "s3:ListBucket"
          ],
          "Resource": "arn:aws:s3:::<s3-bucket-name>"
        },
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
          },
          "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "s3:DeleteObject"
          ],
          "Resource": "arn:aws:s3:::<s3-bucket-name>/*"
        }
      ]
    }
    
  2. Click Save.

Step 3: Note the IAM role used to create the Databricks deployment

This IAM role is the role you used when setting up the Databricks account.

  1. As the account owner, log in to the Account Console.

  2. Click the AWS Account tab.

  3. Note the role name at the end of the Role ARN, here testco-role.

    ../../../_images/iam-role.png

Step 4: Add the S3 IAM role to the EC2 policy

  1. In the AWS console, go to the IAM service.

  2. Click the Roles tab in the sidebar.

  3. Click the role you noted in Step 3.

  4. On the Permissions tab, click the policy.

  5. Click Edit Policy.

  6. Modify the policy to allow Databricks to pass the IAM role you created in Step 1 to the EC2 instances for the Spark clusters. Here is an example of what the new policy should look like. Replace <iam-role-for-s3-access> with the role you created in Step 1:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1403287045000",
          "Effect": "Allow",
          "Action": [
            "ec2:AssociateDhcpOptions",
            "ec2:AssociateRouteTable",
            "ec2:AttachInternetGateway",
            "ec2:AttachVolume",
            "ec2:AuthorizeSecurityGroupEgress",
            "ec2:AuthorizeSecurityGroupIngress",
            "ec2:CancelSpotInstanceRequests",
            "ec2:CreateDhcpOptions",
            "ec2:CreateInternetGateway",
            "ec2:CreateKeyPair",
            "ec2:CreatePlacementGroup",
            "ec2:CreateRoute",
            "ec2:CreateSecurityGroup",
            "ec2:CreateSubnet",
            "ec2:CreateTags",
            "ec2:CreateVolume",
            "ec2:CreateVpc",
            "ec2:CreateVpcPeeringConnection",
            "ec2:DeleteInternetGateway",
            "ec2:DeleteKeyPair",
            "ec2:DeletePlacementGroup",
            "ec2:DeleteRoute",
            "ec2:DeleteRouteTable",
            "ec2:DeleteSecurityGroup",
            "ec2:DeleteSubnet",
            "ec2:DeleteVolume",
            "ec2:DeleteVpc",
            "ec2:DescribeAvailabilityZones",
            "ec2:DescribeInstanceStatus",
            "ec2:DescribeInstances",
            "ec2:DescribePlacementGroups",
            "ec2:DescribePrefixLists",
            "ec2:DescribeReservedInstancesOfferings",
            "ec2:DescribeRouteTables",
            "ec2:DescribeSecurityGroups",
            "ec2:DescribeSpotInstanceRequests",
            "ec2:DescribeSpotPriceHistory",
            "ec2:DescribeSubnets",
            "ec2:DescribeVolumes",
            "ec2:DescribeVpcs",
            "ec2:DetachInternetGateway",
            "ec2:ModifyVpcAttribute",
            "ec2:RequestSpotInstances",
            "ec2:RevokeSecurityGroupEgress",
            "ec2:RevokeSecurityGroupIngress",
            "ec2:RunInstances",
            "ec2:TerminateInstances"
          ],
          "Resource": [
            "*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": "iam:PassRole",
          "Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
        }
      ]
    }
    
  7. Click Review policy.

  8. Click Save changes.

Step 5: Add the S3 IAM role to Databricks

  1. Go to the Admin Console.

  2. Select the IAM Roles tab.

  3. Click the Add IAM Role button. A dialog displays.

  4. Paste in the Instance Profile ARN from Step 1.

    ../../../_images/add-instanceprofile-arn.png

    Databricks validates that this Instance Profile ARN is both syntactically and semantically correct. To validate semantic correctness, Databricks does a dry run by launching a cluster with this IAM role. Any failure in this dry run produces a validation error in the UI.

    Note

    The validation of IAM role can fail if the role contains the tag-enforcement policy, preventing you from adding a legitimate IAM role. If the validation fails and you still want to add the role to Databricks, use the Instance Profiles API and specify skip_validation.

  5. Click Add.

  6. Optionally specify the users who can launch clusters with the IAM role.

../../../_images/add-permissions.png

Step 6: Launch a cluster with the S3 IAM role

  1. On the Instances tab on the cluster creation page, select the IAM role from the IAM Role drop-down list. This drop-down includes all of the IAM roles that are available for the cluster.

    ../../../_images/create-cluster-with-iamrole.png
  2. Verify that you can access the S3 bucket, using the following command:

    dbutils.fs.ls("s3a://<s3-bucket-name>/")
    

    If the command succeeds, go to Step 7.

Important

Once a cluster launches with an IAM role, anyone who has attach permissions to this cluster can access the underlying resources controlled by this role. To guard against unwanted access, you can use cluster ACLs to restrict attach permissions to notebooks.

Step 7: Update cross-account S3 object ACLs

If you are writing to another S3 bucket within the same AWS account, you can stop here. You do not need the following configurations to update the S3 object ACLs.

When you write to a file to a cross-account S3 bucket, the default setting allows only you to access that file. The assumption is that you will write files to your own buckets, and this default setting protects your data. To allow the bucket owner to have access to all of the the data in the bucket, you must set the BucketOwnerFullControl to the objects written by Databricks.

  1. On the Spark tab on the cluster detail page, set the following properties:

    spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem
    spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.canned.acl BucketOwnerFullControl
    spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl
    
  2. Verify that you can write data to the S3 bucket, and check that the permissions enable other tools and users to access the contents written by Databricks.

  3. Use the code below to deploy the settings globally.

    %python
    dbutils.fs.put("/databricks/init/config-s3-cross-account-acls.sh", """
    cat >/databricks/driver/conf/s3-cross-account-spark.conf <<EOL
    [driver] {
        # S3 cross account access configs to set proper ACLs
        "spark.hadoop.fs.s3.impl" = "com.databricks.s3a.S3AFileSystem"
        "spark.hadoop.fs.s3a.impl" = "com.databricks.s3a.S3AFileSystem"
        "spark.hadoop.fs.s3n.impl" = "com.databricks.s3a.S3AFileSystem"
        "spark.hadoop.fs.s3a.canned.acl" = "BucketOwnerFullControl"
        "spark.hadoop.fs.s3a.acl.default" = "BucketOwnerFullControl"
    }
    EOL """, True)
    

Frequently asked questions (FAQ)

I don’t see any IAM roles configured for my access when I create a cluster.
If you are an admin, go to the Admin Console and follow the instructions in this topic to add an IAM role. If you are not an admin, contact your admin, who can add an IAM role using the instructions in this topic.
I am using the s3n or s3 URI schemes and I can’t access S3 resources without passing in credentials, even though my IAM role allows access to the resources.

To use IAM roles when accessing s3 and s3n URIs, add the following configuration parameters to the Spark configuration when you launch a cluster:

spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem
I am using mount points to store credentials. How do mount points work on clusters with IAM roles?

Existing mount points work as they do on clusters that don’t use IAM roles. When you launch a Spark 1.6.3-dbX or Spark 2.0+ clusters with an IAM role, you can also mount an S3 bucket without passing credentials, using:

dbutils.fs.mount("s3a://${pathtobucket}", "/mnt/${MountPointName}")