Amazon S3

Amazon S3 is a service for storing large amounts of unstructured object data, such as text or binary data.

This article explains how to access AWS S3 buckets by mounting buckets using DBFS or directly using APIs.

Important

Databricks Runtime 7.3 LTS and above use an upgraded version of the S3 connector. The following changes can have an impact on existing code:

  • The S3A filesystem releases resources on FileSystem.close(). Since filesystem caching is enabled by default, this can cause other threads with a reference to the cached filesystem to try to use it incorrectly after it is closed. Therefore, you should not use the FileSystem.close() API.
  • The S3A filesystem does not remove directory markers when closing an output stream. Legacy applications based on Hadoop versions that do not include HADOOP-13230 can misinterpret them as empty directories even if there are files inside.

Access S3 buckets through DBFS

This section describes how to access S3 buckets through DBFS. You can:

Mount an S3 bucket

You can mount an S3 bucket through Databricks File System (DBFS). The mount is a pointer to an S3 location, so the data is never synced locally.

Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

There are two ways to mount an S3 bucket:

Mount a bucket using an AWS instance profile

You can manage authentication and authorization for an S3 bucket using an AWS instance profile. The type of access to the objects in the bucket is determined by the permissions granted to the instance profile. If the role has write access, users of the mount point can write objects in the bucket. If the role has read access, users of the mount point will be able to read objects in the bucket.

  1. Configure your cluster with an instance profile.

  2. Mount the bucket.

    aws_bucket_name = "<aws-bucket-name>"
    mount_name = "<mount-name>"
    dbutils.fs.mount("s3a://%s" % aws_bucket_name, "/mnt/%s" % mount_name)
    display(dbutils.fs.ls("/mnt/%s" % mount_name))
    
    val AwsBucketName = "<aws-bucket-name>"
    val MountName = "<mount-name>"
    
    dbutils.fs.mount(s"s3a://$AwsBucketName", s"/mnt/$MountName")
    display(dbutils.fs.ls(s"/mnt/$MountName"))
    

Mount a bucket using AWS keys

You can mount a bucket using AWS keys.

Important

When you mount an S3 bucket using keys, all users have read and write access to all the objects in the S3 bucket.

The following examples use Databricks secrets to store the keys. You must URL escape the secret key.

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"

dbutils.fs.mount("s3a://%s:%s@%s" % (access_key, encoded_secret_key, aws_bucket_name), "/mnt/%s" % mount_name)
display(dbutils.fs.ls("/mnt/%s" % mount_name))
val AccessKey = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
// Encode the Secret Key as that can contain "/"
val SecretKey = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
val EncodedSecretKey = SecretKey.replace("/", "%2F")
val AwsBucketName = "<aws-bucket-name>"
val MountName = "<mount-name>"

dbutils.fs.mount(s"s3a://$AccessKey:$EncodedSecretKey@$AwsBucketName", s"/mnt/$MountName")
display(dbutils.fs.ls(s"/mnt/$MountName"))

Access S3 objects as local files

Once an S3 bucket is mounted to DBFS you can access S3 objects using local file paths.

df = spark.read.text("/mnt/%s/..." % mount_name)

or

df = spark.read.text("dbfs:/mnt/%s/..." % mount_name)
// scala
val df = spark.read.text(s"/mnt/$MountName/...")

or

val df = spark.read.text(s"dbfs:/mnt/$MountName/...")

Unmount an S3 bucket

dbutils.fs.unmount("/mnt/mount_name")
dbutils.fs.unmount(s"/mnt/$MountName")

Access S3 buckets directly

This method allows Spark workers to access an object in an S3 bucket directly using AWS keys. It uses Databricks secrets to store the keys.

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)

# If you are using Auto Loader file notification mode to load files, provide the AWS Region ID.
aws_region = "aws-region-id"
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com")

myRDD = sc.textFile("s3a://%s/.../..." % aws_bucket_name)
myRDD.count()

Configure KMS encryption for s3a:// paths

Step 2: Add the instance profile as a key user for the KMS key provided in the configuration

  1. In AWS, go to the IAM service.
  2. Click Encryption Keys at the bottom of the sidebar.
  3. Click the key that you want to add permission to.
  4. In the Key Users section, click Add.
  5. Select the checkbox next to the IAM role.
  6. Click Attach.

Step 3: Set up encryption properties

Set up global KMS encryption properties in a AWS configurations setting or using an init script. Configure the spark.hadoop.fs.s3a.server-side-encryption.key key with your own key ARN.

Spark configuration
spark.hadoop.fs.s3a.server-side-encryption.key arn:aws:kms:<region>:<aws-account-id>:key/<bbbbbbbb-ddd-ffff-aaa-bdddddddddd>
spark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS

You can also configure per-bucket KMS encryption.

Init script

Configure the global encryption setting by running the following code in a notebook cell to create the init script set-kms.sh and configure a cluster to run the script.

dbutils.fs.put("/databricks/scripts/set-kms.sh", """
#!/bin/bash

cat >/databricks/driver/conf/aes-encrypt-custom-spark-conf.conf <<EOL
[driver] {
  "spark.hadoop.fs.s3a.server-side-encryption.key" = "arn:aws:kms:<region>:<aws-account-id>:key/<bbbbbbbb-ddd-ffff-aaa-bdddddddddd>"
  "spark.hadoop.fs.s3a.server-side-encryption-algorithm" = "SSE-KMS"
}
EOL
""", True)

Once you verify that encryption is working, configure encryption on all clusters using a global init script.

Encrypt data in S3 buckets

Databricks supports encrypting data using server-side encryption. This section covers how to use server-side encryption when writing files in S3 through DBFS. Databricks supports Amazon S3-managed encryption keys (SSE-S3) and AWS KMS–managed encryption keys (SSE-KMS).

Write files using SSE-S3

  1. To mount your S3 bucket with SSE-S3, run:

    dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-s3")
    
  2. To write files to the corresponding S3 bucket with SSE-S3, run:

    dbutils.fs.put(s"/mnt/$MountName", "<file content>")
    

Write files using SSE-KMS

  1. Mount a source directory passing in sse-kms or sse-kms:$KmsKey as the encryption type.

    • To mount your S3 bucket with SSE-KMS using the default KMS master key, run:

      dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-kms")
      
    • To mount your S3 bucket with SSE-KMS using a specific KMS key, run:

      dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-kms:$KmsKey")
      
  2. To write files to the S3 bucket with SSE-KMS, run:

    dbutils.fs.put(s"/mnt/$MountName", "<file content>")
    

Configuration

Databricks Runtime 7.3 LTS and above support configuring the S3A filesystem using open-source Hadoop options. You can configure global properties and per-bucket properties.

Global configuration

# Global S3 configuration
spark.hadoop.fs.s3a.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.endpoint <aws-endpoint>
spark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS

Per-bucket configuration

You configure per-bucket properties using the syntax spark.hadoop.fs.s3a.bucket.<bucket-name>.<configuration-key>. This lets you set up buckets with different credentials, endpoints, and so on.

For example, in addition to global S3 settings you can configure each bucket individually using the following keys:

# Set up authentication and endpoint for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.bucket.<bucket-name>.endpoint <aws-endpoint>

# Configure a different KMS encryption key for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.server-side-encryption.key <aws-kms-encryption-key>

Access Requester Pays buckets

To enable access to Requester Pays buckets, add the following line to your cluster’s AWS configurations:

spark.hadoop.fs.s3a.requester-pays.enabled true

Note

Databricks does not support Delta Lake writes to Requester Pays buckets.