• Databricks
  • Databricks
  • Support
  • Feedback
  • Try Databricks
  • Help Center
  • Documentation
  • Knowledge Base
Databricks on AWS

Getting started

  • Introduction
  • Get started
  • Tutorials and best practices

User guides

  • Data Science & Engineering
  • Machine Learning
  • Databricks SQL
  • Data
  • Delta Lake
  • Developer tools
  • Integrations

Administration guides

  • Accounts and workspaces
    • Administration overview
    • Manage your Databricks account
    • Manage your Databricks account (legacy)
    • Access the admin console
    • Manage users and groups
    • Enable Databricks SQL for users and groups
    • Enable access control
    • Manage workspace objects and behavior
    • Manage cluster configuration options
    • Manage AWS infrastructure
      • Get EC2 instance information
      • Secure access to S3 buckets using instance profiles
      • Secure access to S3 buckets across accounts using instance profiles with an AssumeRole policy
      • Secure access to Kinesis across accounts using instance profiles with an AssumeRole policy
      • Set up AWS authentication for SageMaker deployment
      • Proxy traffic through a NAT gateway
      • VPC peering
      • Customer-managed VPC
      • Enable AWS PrivateLink
      • AWS firewall restrictions
      • AWS configuration resources
      • Serverless compute
      • Supported Databricks clouds and regions
      • Configure Databricks S3 commit service-related settings
        • About the commit service
        • AWS GuardDuty alerts related to S3 commit service
        • Disable the direct upload optimization
        • Additional bucket security restrictions
    • Disaster recovery
    • Databricks access to customer workspaces using Genie
  • Security
  • Data governance
  • Data sharing

Reference guides

  • API reference
  • SQL reference
  • CLI and utilities

Resources

  • Release notes
  • Other resources

Updated May 13, 2022

Send us feedback

  • Documentation
  • Administration guide
  • Manage AWS infrastructure
  • Configure Databricks S3 commit service-related settings

Configure Databricks S3 commit service-related settings

Databricks runs a commit service that coordinates writes to Amazon S3 from multiple clusters. This service runs in the Databricks control plane. For additional security, you can disable the service’s direct upload optimization as described in Disable the direct upload optimization. To further restrict access to your S3 buckets, see Additional bucket security restrictions.

If you receive AWS GuardDuty alerts related to the S3 commit service, see AWS GuardDuty alerts related to S3 commit service.

About the commit service

The S3 commit service helps guarantee consistency of writes across multiple clusters on a single table in specific cases. For example, the commit service helps Delta Lake implement ACID transactions.

In the default configuration, Databricks sends temporary AWS credentials from the data plane to the control plane in the commit service API call. Instance profile credentials are valid for six hours.

The data plane writes data directly to S3, and then the S3 commit service in the control plane provides concurrency control by finalizing the commit log upload (completing the multipart upload described below). The commit service does not read any data from S3. It puts a new file in S3 if it does not exist.

The most common data that is written to S3 by the Databricks commit service is the Delta log, which contains statistical aggregates from your data, such as the column’s minimum and maximum values. Most Delta log data is sent to S3 from the data plane using an Amazon S3 multipart upload.

After the cluster stages the multipart data to write the Delta log to S3, the S3 commit service in the Databricks control plane finishes the S3 multipart upload by letting S3 know that it is complete. As a performance optimization for very small updates, by default the commit service sometimes pushes small updates directly from the control plane to S3. This direct update optimization can be disabled. See Disable the direct upload optimization.

In addition to Delta Lake, the following Databricks features use the same S3 commit service:

  • Structured Streaming

  • Auto Loader

  • The SQL command COPY INTO

The commit service is necessary because Amazon doesn’t provide an operation that puts an object only if it does not yet exist. Amazon S3 is a distributed system. If S3 receives multiple write requests for the same object simultaneously, it overwrites all but the last object written. Without the ability to centrally verify commits, simultaneous commits from different clusters would corrupt tables.

AWS GuardDuty alerts related to S3 commit service

If you use AWS GuardDuty and you access data using AWS IAM instance profiles, GuardDuty may create alerts for default Databricks behavior related to Delta Lake, Structured Streaming, Auto Loader, or COPY INTO. These alerts are related to instance credential exfiltration detection, which is enabled by default. These alerts include the title UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.InsideAWS.

You can configure your Databricks deployment to address GuardDuty alerts related to the S3 commit service by creating an AWS instance profile that assumes the role of your original S3 data access IAM role.

As an alternative to using instance profile credentials, this new instance profile can configure clusters to assume a role with short duration tokens. This capability already exists in all recent Databricks Runtime versions and can be enforced globally via cluster policies.

  1. If you have not already done so, create a normal instance profile to access the S3 data. This instance profile uses instance profile credentials to directly access the S3 data.

    This section refers to the role ARN in this instance profile as the <data-role-arn>.

  2. Create a new instance profile that will use tokens and references your instance profile that directly accesses the data. Your cluster will reference this new token-based instance profile. See Secure access to S3 buckets using instance profiles.

    This instance profile does not need any direct S3 access. Instead it needs only the permissions to assume the IAM role that you use for data access. This section refers to the role ARN in this instance profile as the <cluster-role-arn>.

    1. Add an attached IAM policy on the new cluster instance profile IAM role (<cluster-role-arn>). Add the following policy statement to your new cluster Instance profile IAM Role and replace <data-role-arn> with the ARN of your original instance profile that accesses your bucket.

      {
        "Effect": "Allow",
        "Action": "sts:AssumeRole",
        "Resource": "<data-role-arn>"
      }
      
    2. Add a trust policy statement to your existing data access IAM Role and replace <cluster-role-arn> with the ARN of the original instance profile that accesses your bucket.

      {
        "Effect": "Allow",
        "Principal": {
            "AWS": "<cluster-role-arn>"
        },
        "Action": "sts:AssumeRole"
      }
      
  3. To use notebook code that makes direct connection to S3 without using DBFS, configure your clusters to use the new token-based instance profile and to assume the data access role.

    • Configure a cluster for S3 access to all buckets. Add the following to the cluster’s Spark configuration:

      fs.s3a.credentialsType AssumeRole
      fs.s3a.stsAssumeRole.arn <data-role-arn>
      
    • With Databricks Runtime 8.3 and above, you can configure this for a specific bucket:

      fs.s3a.bucket.<bucket-name>.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
      fs.s3a.bucket.<bucket-name>.assumed.role.arn <data-role-arn>
      
  4. If you use DBFS mounts of S3 buckets, your notebook code must mount the bucket and add the AssumeRole configuration. This step is necessary only for DBFS mounts, not for accessing root DBFS storage in your workspace’s root S3 bucket. The following example uses Python:

    
    # If other code has already mounted the bucket without using the new role, unmount it first
    dbutils.fs.unmount("/mnt/<mount-name>")
    
    # mount the bucket and assume the new role
    dbutils.fs.mount("s3a://<bucket-name>/", "/mnt/<mount-name>", extra_configs = {
      "fs.s3a.credentialsType": "AssumeRole",
      "fs.s3a.stsAssumeRole.arn": "<role-arn>"
    })
    

Disable the direct upload optimization

As a performance optimization for very small updates, by default the commit service sometimes pushes small updates directly from the control plane to S3. To disable this optimization, set the Spark parameter spark.hadoop.fs.s3a.databricks.s3commit.directPutFileSizeThreshold to 0. You can apply this setting in the cluster’s Spark config or set it in a global init script.

Disabling this feature may result in a small performance impact for near real-time Structured Streaming queries with constant small updates. Consider testing the performance impact with your data before disabling this feature in production.

Additional bucket security restrictions

The following bucket policy configurations further restrict access to your S3 buckets.

Neither of these changes affects GuardDuty alerts.

  • Limit the bucket access to specific IP addresses and S3 operations. If you are interested in additional controls on your bucket, you can limit specific S3 buckets to be accessible only from specific IP addresses. For example, you can restrict access to only your own environment and the IP addresses for the Databricks control plane, including the S3 commit service. See Restrict access to your S3 buckets. This configuration limits the risk that credentials are used from other locations.

  • Limit S3 operation types outside the required directories. You can deny access from the Databricks control plane to your S3 bucket outside the required directories for the S3 commit service. You also can limit the operations in those directories to just the required S3 operations put and list from Databricks IP addresses. The Databricks control plane (including the S3 commit service) does not require get access on the bucket.

    {
      "Sid": "LimitCommitServiceActions",
      "Effect": "Deny",
      "Principal": "*",
      "NotAction": [
          "s3:ListBucket",
          "s3:GetBucketLocation",
          "s3:PutObject"
      ],
      "Resource": [
          "arn:aws:s3:::<bucket-name>/*",
          "arn:aws:s3:::<bucket-name>"
      ],
      "Condition": {
          "IpAddress": {
              "aws:SourceIp": "<control-plane-ip>"
          }
      }
    },
    {
      "Sid": "LimitCommitServicePut",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "NotResource": [
          "arn:aws:s3:::<bucket-name>/*_delta_log/*",
          "arn:aws:s3:::<bucket-name>/*_spark_metadata/*",
          "arn:aws:s3:::<bucket-name>/*offsets/*",
          "arn:aws:s3:::<bucket-name>/*sources/*",
          "arn:aws:s3:::<bucket-name>/*sinks/*",
          "arn:aws:s3:::<bucket-name>/*_schemas/*"
      ],
      "Condition": {
          "IpAddress": {
              "aws:SourceIp": "<control-plane-ip>"
          }
      }
    },
    {
      "Sid": "LimitCommitServiceList",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::<bucket-name>",
      "Condition": {
          "StringNotLike": {
              "s3:Prefix": [
                  "*_delta_log/*",
                  "*_spark_metadata/*",
                  "*offsets/*",
                  "*sources/*",
                  "*sinks/*",
                  "*_schemas/*"
              ]
          },
          "IpAddress": {
              "aws:SourceIp": "<control-plane-ip>"
          }
      }
    }
    

    Replace <control-plane-ip> with your regional IP address for the Databricks control plane. Replace <bucket-name> with your S3 bucket name.


© Databricks 2022. All rights reserved. Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation.

Send us feedback | Privacy Policy | Terms of Use