Secure access to S3 buckets using instance profiles

An IAM role is an AWS identity with permission policies that determine what the identity can and cannot do in AWS. An instance profile is a container for an IAM role that you can use to pass the role information to an EC2 instance when the instance starts.

In order to access AWS resources securely, you can launch Databricks clusters with instance profiles that allow you to access your data from Databricks clusters without having to embed your AWS keys in notebooks. This article explains how to set up instance profiles and use them in Databricks to access S3 buckets securely.

Note

An alternative to using instance profiles for access to S3 buckets from Databricks clusters is IAM credential passthrough, which passes an individual user’s IAM role to Databricks and uses that IAM role to determine access to data in S3. This allows multiple users with different data access policies to share a Databricks cluster. Instance profiles, by contrast, are associated with only one IAM role, which requires that all users of a Databricks cluster share that role and its data access policies. For more information, see Secure access to S3 buckets using IAM credential passthrough with Databricks SCIM.

Requirements

  • AWS administrator access to IAM roles and policies in the AWS account of the Databricks deployment and the AWS account of the S3 bucket.
  • Target S3 bucket. This bucket must belong to the same AWS account as the Databricks deployment or there must be a cross-account bucket policy that allows access to this bucket from the AWS account of the Databricks deployment.
  • If you intend to enable encryption for the S3 bucket, you must add the IAM role as a Key User for the KMS key provided in the configuration. See Configure KMS encryption.

Step 1: Create an instance profile to access an S3 bucket

  1. In the AWS console, go to the IAM service.

  2. Click the Roles tab in the sidebar.

  3. Click Create role.

    1. Under Select type of trusted entity, select AWS service.

    2. Under Choose the service that will use this role, select EC2.

      Select service
    3. Click Next: Permissions, Next: Tags, and Next: Review.

    4. In the Role name field, type a role name.

    5. Click Create role. The list of roles displays.

  4. In the role list, click the role.

  5. Add an inline policy to the role. This policy grants access to the S3 bucket.

    1. In the Permissions tab, click Add Inline policy.

    2. Click the JSON tab.

    3. Copy this policy and set <s3-bucket-name> to the name of your bucket.

      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Action": [
              "s3:ListBucket"
            ],
           "Resource": [
              "arn:aws:s3:::<s3-bucket-name>"
            ]
          },
          {
            "Effect": "Allow",
            "Action": [
              "s3:PutObject",
              "s3:GetObject",
              "s3:DeleteObject",
              "s3:PutObjectAcl"
            ],
            "Resource": [
               "arn:aws:s3:::<s3-bucket-name>/*"
            ]
          }
        ]
      }
      
    4. Click Review policy.

    5. In the Name field, type a policy name.

    6. Click Create policy.

  6. In the role summary, copy the Instance Profile ARN.

    Instance profile ARN

Step 2: Create a bucket policy for the target S3 bucket

At a minimum, the S3 policy must include the ListBucket and GetObject actions.

Important

The s3:PutObjectAcl permission is required if you perform Step 7: Update cross-account S3 object ACLs to configure the bucket owner to have access to all of the the data in the bucket.

Bucket policy
  1. Paste in a policy. A sample cross-account bucket IAM policy could be the following, replacing <aws-account-id-databricks> with the AWS account ID where the Databricks environment is deployed, <iam-role-for-s3-access> with the role you created in Step 1, and <s3-bucket-name> with the bucket name.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Example permissions",
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
          },
          "Action": [
            "s3:GetBucketLocation",
            "s3:ListBucket"
          ],
          "Resource": "arn:aws:s3:::<s3-bucket-name>"
        },
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
          },
          "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "s3:DeleteObject",
            "s3:PutObjectAcl"
          ],
          "Resource": "arn:aws:s3:::<s3-bucket-name>/*"
        }
      ]
    }
    
  2. Click Save.

Step 3: Note the IAM role used to create the Databricks deployment

This IAM role is the role you used when setting up the Databricks account.

  1. As the account owner, log in to the Account Console.

  2. Click the AWS Account tab.

  3. Note the role name at the end of the Role ARN, here testco-role.

    IAM role

Step 4: Add the S3 IAM role to the EC2 policy

  1. In the AWS console, go to the IAM service.

  2. Click the Roles tab in the sidebar.

  3. Click the role you noted in Step 3.

  4. On the Permissions tab, click the policy.

  5. Click Edit Policy.

  6. Modify the policy to allow Databricks to pass the IAM role you created in Step 1 to the EC2 instances for the Spark clusters. Here is an example of what the new policy should look like. Replace <iam-role-for-s3-access> with the role you created in Step 1:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1403287045000",
          "Effect": "Allow",
          "Action": [
            "ec2:AssociateDhcpOptions",
            "ec2:AssociateIamInstanceProfile",
            "ec2:AssociateRouteTable",
            "ec2:AttachInternetGateway",
            "ec2:AttachVolume",
            "ec2:AuthorizeSecurityGroupEgress",
            "ec2:AuthorizeSecurityGroupIngress",
            "ec2:CancelSpotInstanceRequests",
            "ec2:CreateDhcpOptions",
            "ec2:CreateInternetGateway",
            "ec2:CreateKeyPair",
            "ec2:CreatePlacementGroup",
            "ec2:CreateRoute",
            "ec2:CreateSecurityGroup",
            "ec2:CreateSubnet",
            "ec2:CreateTags",
            "ec2:CreateVolume",
            "ec2:CreateVpc",
            "ec2:CreateVpcPeeringConnection",
            "ec2:DeleteInternetGateway",
            "ec2:DeleteKeyPair",
            "ec2:DeletePlacementGroup",
            "ec2:DeleteRoute",
            "ec2:DeleteRouteTable",
            "ec2:DeleteSecurityGroup",
            "ec2:DeleteSubnet",
            "ec2:DeleteTags",
            "ec2:DeleteVolume",
            "ec2:DeleteVpc",
            "ec2:DescribeAvailabilityZones",
            "ec2:DescribeIamInstanceProfileAssociations",
            "ec2:DescribeInstanceStatus",
            "ec2:DescribeInstances",
            "ec2:DescribePlacementGroups",
            "ec2:DescribePrefixLists",
            "ec2:DescribeReservedInstancesOfferings",
            "ec2:DescribeRouteTables",
            "ec2:DescribeSecurityGroups",
            "ec2:DescribeSpotInstanceRequests",
            "ec2:DescribeSpotPriceHistory",
            "ec2:DescribeSubnets",
            "ec2:DescribeVolumes",
            "ec2:DescribeVpcs",
            "ec2:DetachInternetGateway",
            "ec2:DisassociateIamInstanceProfile",
            "ec2:ModifyVpcAttribute",
            "ec2:ReplaceIamInstanceProfileAssociation",
            "ec2:RequestSpotInstances",
            "ec2:RevokeSecurityGroupEgress",
            "ec2:RevokeSecurityGroupIngress",
            "ec2:RunInstances",
            "ec2:TerminateInstances"
          ],
          "Resource": [
            "*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": "iam:PassRole",
          "Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
        },
        {
          "Effect": "Allow",
          "Action": [
            "iam:CreateServiceLinkedRole",
            "iam:PutRolePolicy"
          ],
          "Resource": "arn:aws:iam::*:role/aws-service-role/spot.amazonaws.com/AWSServiceRoleForEC2Spot",
          "Condition": {
            "StringLike": {
              "iam:AWSServiceName": "spot.amazonaws.com"
            }
          }
        }
      ]
    }
    
  7. Click Review policy.

  8. Click Save changes.

Step 5: Add the instance profile to Databricks

  1. Go to the Admin Console.

  2. Click the Instance Profiles tab.

  3. Click the Add Instance Profile button. A dialog displays.

  4. Paste in the instance profile ARN from Step 1.

    Add instance profile ARN

    You select the Meta Instance Profile property only when you are setting up IAM credential passthrough.

    Databricks validates that the instance profile ARN is both syntactically and semantically correct. To validate semantic correctness, Databricks does a dry run by launching a cluster with this instance profile. Any failure in this dry run produces a validation error in the UI.

    Validation of the instance profile can fail if the instance profile contains the tag-enforcement policy, preventing you from adding a legitimate instance profile. If the validation fails and you still want to add the instance profile to Databricks, use the Instance Profiles API and specify skip_validation.

  5. Click Add.

  6. Optionally specify the users who can launch clusters with the instance profile.

    Add users

Step 6: Launch a cluster with the instance profile

  1. Select or create a cluster.

  2. Open the Advanced Options section.

  3. On the Instances tab, select the instance profile from the Instance Profile drop-down list. This drop-down includes all of the instance profiles that are available for the cluster.

    Select instance profile
  4. Verify that you can access the S3 bucket, using the following command:

    dbutils.fs.ls("s3a://<s3-bucket-name>/")
    

    If the command succeeds, go to Step 7.

Warning

Once a cluster launches with an instance profile, anyone who has attach permission to the cluster can access the underlying resources controlled by this role. To limit unwanted access, you can use cluster ACLs to restrict attach permissions.

Step 7: Update cross-account S3 object ACLs

If you are writing to another S3 bucket within the same AWS account, you can stop here.

When you write to a file in a cross-account S3 bucket, the default setting allows only you to access that file. The assumption is that you will write files to your own buckets, and this default setting protects your data. To allow the bucket owner to have access to all of the the objects in the bucket, you must add the BucketOwnerFullControl ACL to the objects written by Databricks.

  1. On the Spark tab on the cluster detail page, set the following properties:

    spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem
    spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.canned.acl BucketOwnerFullControl
    spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl
    
  2. Verify that you can write data to the S3 bucket, and check that the permissions enable other tools and users to access the contents written by Databricks.

Frequently asked questions (FAQ)

I don’t see any instance profiles configured for my access when I create a cluster.

If you are an admin, go to the Admin Console and follow the instructions in this article to add an instance profile. Otherwise, contact your admin, who can add an instance profile using the instructions in this article.

I am using the s3n or s3 URI schemes and I can’t access S3 resources without passing in credentials, even though my instance profile allows access to the resources.

To use instance profile when accessing s3 and s3n URIs, add the following configuration parameters to the Spark configuration when you launch a cluster:

spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem

I am using mount points to store credentials. How do mount points work on clusters with instance profile?

Existing mount points work as they do on clusters that don’t use instance profile. When you launch a cluster with an instance profile, you can also mount an S3 bucket without passing credentials, using:

dbutils.fs.mount("s3a://${pathtobucket}", "/mnt/${MountPointName}")