Connect to Amazon S3
Note
This article describes legacy patterns for configuring access to S3. Databricks recommends using Unity Catalog to configure access to S3 and volumes for direct interaction with files. See Connect to cloud object storage and services using Unity Catalog.
This article explains how to connect to AWS S3 from Databricks.
Access S3 buckets using instance profiles
You can load IAM roles as instance profiles in Databricks and attach instance profiles to clusters to control data access to S3. Databricks recommends using instance profiles when Unity Catalog is unavailable for your environment or workload. For a tutorial on using instance profiles with Databricks, see Tutorial: Configure S3 access with an instance profile.
The AWS user who creates the IAM role must:
Be an AWS account user with permission to create or update IAM roles, IAM policies, S3 buckets, and cross-account trust relationships.
The Databricks user who adds the IAM role as an instance profile in Databricks must:
Be a workspace admin
Once you add the instance profile to your workspace, you can grant users, groups, or service principals have permissions to launch clusters with the instance profile. See Manage instance profiles in Databricks.
Use both cluster access control and notebook access control together to protect access to the instance profile. See Compute permissions and Collaborate using Databricks notebooks.
Access S3 buckets with URIs and AWS keys
You can set Spark properties to configure a AWS keys to access S3.
Databricks recommends using secret scopes for storing all credentials. You can grant users, service principals, and groups in your workspace access to read the secret scope. This protects the AWS key while allowing users to access S3. To create a secret scope, see Manage secret scopes.
The credentials can be scoped to either a cluster or a notebook. Use both cluster access control and notebook access control together to protect access to S3. See Compute permissions and Collaborate using Databricks notebooks.
To set Spark properties, use the following snippet in a cluster’s Spark configuration to set the AWS keys stored in secret scopes as environment variables:
AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}
You can then read from S3 using the following commands:
aws_bucket_name = "my-s3-bucket"
df = spark.read.load(f"s3a://{aws_bucket_name}/flowers/delta/")
display(df)
dbutils.fs.ls(f"s3a://{aws_bucket_name}/")
Access S3 with open-source Hadoop options
Databricks Runtime supports configuring the S3A filesystem using open-source Hadoop options. You can configure global properties and per-bucket properties.
Global configuration
# Global S3 configuration
spark.hadoop.fs.s3a.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.endpoint <aws-endpoint>
spark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS
Per-bucket configuration
You configure per-bucket properties using the syntax spark.hadoop.fs.s3a.bucket.<bucket-name>.<configuration-key>
. This lets you set up buckets with different credentials, endpoints, and so on.
For example, in addition to global S3 settings you can configure each bucket individually using the following keys:
# Set up authentication and endpoint for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.bucket.<bucket-name>.endpoint <aws-endpoint>
# Configure a different KMS encryption key for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.server-side-encryption.key <aws-kms-encryption-key>
Access Requester Pays buckets
To enable access to Requester Pays buckets, add the following line to your cluster’s Spark configuration:
spark.hadoop.fs.s3a.requester-pays.enabled true
Note
Databricks does not support Delta Lake writes to Requester Pays buckets.
Deprecated patterns for storing and accessing data from Databricks
The following are deprecated storage patterns:
Databricks no longer recommends mounting external data locations to Databricks Filesystem. See Mounting cloud object storage on Databricks.
Databricks no longer recommends using credential passthrough with S3. See Credential passthrough (legacy).
Important
The S3A filesystem enables caching by default and releases resources on ‘FileSystem.close()’. To avoid other threads using a reference to the cached file system incorrectly, do not explicitly use the ‘FileSystem.close().
The S3A filesystem does not remove directory markers when closing an output stream. Legacy applications based on Hadoop versions that do not include HADOOP-13230 can misinterpret them as empty directories even if there are files inside.