Working with data in Amazon S3

Databricks maintains optimized drivers for connecting to AWS S3. Amazon S3 is a service for storing large amounts of unstructured object data, such as text or binary data.

This article explains how to access AWS S3 buckets.

Important

  • The S3A filesystem enables caching by default and releases resources on ‘FileSystem.close()’. To avoid other threads using a reference to the cached file system incorrectly, do not explicitly use the ‘FileSystem.close().

  • The S3A filesystem does not remove directory markers when closing an output stream. Legacy applications based on Hadoop versions that do not include HADOOP-13230 can misinterpret them as empty directories even if there are files inside.

Note

If you are looking for information on working with mounted S3 data, see Mounting cloud object storage on Databricks.

Access S3 buckets with Unity Catalog external locations

Unity Catalog manages access to data in S3 buckets using external locations. Administrators primarily use external locations to configure Unity Catalog external tables, but can also delegate access to users or groups using the available privileges (READ FILES, WRITE FILES, and CREATE TABLE).

Use the fully qualified S3 URI to access data secured with Unity Catalog. Because permissions are managed by Unity Catalog, you do not need to pass any additional options or configurations for authentication.

Warning

Unity Catalog ignores Spark configuration settings when accessing data managed by external locations.

Examples of reading:

dbutils.fs.ls("s3://my-bucket/external-location/path/to/data")

spark.read.format("parquet").load("s3://my-bucket/external-location/path/to/data")

spark.sql("SELECT * FROM parquet.`s3://my-bucket/external-location/path/to/data`")

Examples of writing:

dbutils.fs.mv("s3://my-bucket/external-location/path/to/data", "s3://my-bucket/external-location/path/to/new-location")

df.write.format("parquet").save("s3://my-bucket/external-location/path/to/new-location")

Examples of creating external tables:

df.write.option("path", "s3://my-bucket/external-location/path/to/table").saveAsTable("my_table")

spark.sql("""
  CREATE TABLE my_table
  LOCATION "s3://my-bucket/external-location/path/to/table"
  AS (SELECT *
    FROM parquet.`s3://my-bucket/external-location/path/to/data`)
""")

Access S3 buckets using instance profiles

You can load IAM roles as instance profiles in Databricks and attach instance profiles to clusters to control data access to S3. Databricks recommends using instance profiles when Unity Catalog is unavailable for your environment or workload. See Secure access to S3 buckets using instance profiles.

Access S3 buckets with URIs and AWS keys

This method allows Spark workers to access an object in an S3 bucket directly using AWS keys. It uses Databricks secrets to store the keys.

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)

# If you are using Auto Loader file notification mode to load files, provide the AWS Region ID.
aws_region = "aws-region-id"
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com")

myRDD = sc.textFile("s3a://%s/.../..." % aws_bucket_name)
myRDD.count()

Configure KMS encryption for s3a:// paths

Step 1: Configure an instance profile

In Databricks, create an instance profile.

Step 2: Add the instance profile as a key user for the KMS key provided in the configuration

  1. In AWS, go to the KMS service.

  2. Click the key that you want to add permission to.

  3. In the Key Users section, click Add.

  4. Select the checkbox next to the IAM role.

  5. Click Add.

Step 3: Set up encryption properties

Set up global KMS encryption properties in a Spark configuration setting or using an init script. Configure the spark.hadoop.fs.s3a.server-side-encryption.key key with your own key ARN.

Spark configuration

spark.hadoop.fs.s3a.server-side-encryption.key arn:aws:kms:<region>:<aws-account-id>:key/<bbbbbbbb-ddd-ffff-aaa-bdddddddddd>
spark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS

You can also configure per-bucket KMS encryption.

Init script

Configure the global encryption setting by running the following code in a notebook cell to create the init script set-kms.sh and configure a cluster to run the script.

dbutils.fs.put("/databricks/scripts/set-kms.sh", """
#!/bin/bash

cat >/databricks/driver/conf/aes-encrypt-custom-spark-conf.conf <<EOL
[driver] {
  "spark.hadoop.fs.s3a.server-side-encryption.key" = "arn:aws:kms:<region>:<aws-account-id>:key/<bbbbbbbb-ddd-ffff-aaa-bdddddddddd>"
  "spark.hadoop.fs.s3a.server-side-encryption-algorithm" = "SSE-KMS"
}
EOL
""", True)

Once you verify that encryption is working, configure encryption on all clusters using a global init script.

Configuration

Databricks Runtime 7.3 LTS and above support configuring the S3A filesystem using open-source Hadoop options. You can configure global properties and per-bucket properties.

Global configuration

# Global S3 configuration
spark.hadoop.fs.s3a.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.endpoint <aws-endpoint>
spark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS

Per-bucket configuration

You configure per-bucket properties using the syntax spark.hadoop.fs.s3a.bucket.<bucket-name>.<configuration-key>. This lets you set up buckets with different credentials, endpoints, and so on.

For example, in addition to global S3 settings you can configure each bucket individually using the following keys:

# Set up authentication and endpoint for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.bucket.<bucket-name>.endpoint <aws-endpoint>

# Configure a different KMS encryption key for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.server-side-encryption.key <aws-kms-encryption-key>

Access Requester Pays buckets

To enable access to Requester Pays buckets, add the following line to your cluster’s Spark configuration:

spark.hadoop.fs.s3a.requester-pays.enabled true

Note

Databricks does not support Delta Lake writes to Requester Pays buckets.