Connect to Amazon S3

This article explains how to connect to AWS S3 from Databricks.

Databricks recommends using Unity Catalog to configure access to S3 and volumes for direct interaction with files. See Connect to cloud object storage using Unity Catalog.

Access S3 buckets using instance profiles

You can load IAM roles as instance profiles in Databricks and attach instance profiles to clusters to control data access to S3. Databricks recommends using instance profiles when Unity Catalog is unavailable for your environment or workload. For a tutorial on using instance profiles with Databricks, see Tutorial: Configure S3 access with an instance profile.

The AWS user who creates the IAM role must:

  • Be an AWS account user with permission to create or update IAM roles, IAM policies, S3 buckets, and cross-account trust relationships.

The Databricks user who adds the IAM role as an instance profile in Databricks must:

  • Be a workspace admin

Once you add the instance profile to your workspace, you can grant users, groups, or service principals have permissions to launch clusters with the instance profile. See Manage instance profiles in Databricks.

Use both cluster access control and notebook access control together to protect access to the instance profile. See Cluster access control and Collaborate using Databricks notebooks.

Access S3 buckets with URIs and AWS keys

You can set Spark properties to configure a AWS keys to access S3.

Databricks recommends using secret scopes for storing all credentials. You can grant users, service principals, and groups in your workspace access to read the secret scope. This protects the AWS key while allowing users to access S3. To create a secret scope, see Secret scopes.

The credentials can be scoped to either a cluster or a notebook. Use both cluster access control and notebook access control together to protect access to S3. See Cluster access control and Collaborate using Databricks notebooks.

To set Spark properties, use the following snippet in a cluster’s Spark configuration to set the AWS keys stored in secret scopes as environment variables:

AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}

You can then read from S3 using the following commands:

aws_bucket_name = "my-s3-bucket"

df = spark.read.load(f"s3a://{aws_bucket_name}/flowers/delta/")
display(df)
dbutils.fs.ls(f"s3a://{aws_bucket_name}/")

Access S3 with open-source Hadoop options

Databricks Runtime supports configuring the S3A filesystem using open-source Hadoop options. You can configure global properties and per-bucket properties.

Global configuration

# Global S3 configuration
spark.hadoop.fs.s3a.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.endpoint <aws-endpoint>
spark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS

Per-bucket configuration

You configure per-bucket properties using the syntax spark.hadoop.fs.s3a.bucket.<bucket-name>.<configuration-key>. This lets you set up buckets with different credentials, endpoints, and so on.

For example, in addition to global S3 settings you can configure each bucket individually using the following keys:

# Set up authentication and endpoint for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.bucket.<bucket-name>.endpoint <aws-endpoint>

# Configure a different KMS encryption key for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.server-side-encryption.key <aws-kms-encryption-key>

Access Requester Pays buckets

To enable access to Requester Pays buckets, add the following line to your cluster’s Spark configuration:

spark.hadoop.fs.s3a.requester-pays.enabled true

Note

Databricks does not support Delta Lake writes to Requester Pays buckets.

Deprecated patterns for storing and accessing data from Databricks

The following are deprecated storage patterns:

Important

  • The S3A filesystem enables caching by default and releases resources on ‘FileSystem.close()’. To avoid other threads using a reference to the cached file system incorrectly, do not explicitly use the ‘FileSystem.close().

  • The S3A filesystem does not remove directory markers when closing an output stream. Legacy applications based on Hadoop versions that do not include HADOOP-13230 can misinterpret them as empty directories even if there are files inside.