Connect to Amazon S3
This article explains how to connect to AWS S3 from Databricks.
Databricks recommends using Unity Catalog volumes or external locations to connect to S3. See Recommendations for using external locations.
Connect to S3 with Unity Catalog
External locations and storage credentials allow Unity Catalog to read and write data in S3 on behalf of users. Administrators primarily use external locations to configure Unity Catalog external tables.
A storage credential is a Unity Catalog object used for authentication to S3. It is an IAM role that authorizes reading from and writing to an S3 bucket path. An external location is an object that combines a cloud storage path with a storage credential.
Who can create and manage volumes?
To create volumes, you must have the following privileges:
USE SCHEMA
andCREATE VOLUME
on the schemaUSE CATALOG
on the catalog(External volumes only)
CREATE EXTERNAL LOCATION
on the external location
After you create a volume, the following principals can manage volume privileges:
The owner of the parent catalog.
The owner of the parent schema.
The owner of the volume.
Who can create and manage external locations and storage credentials?
The AWS user who creates the IAM role for the storage credential must:
Be an AWS account user with permission to create or update IAM roles, IAM policies, S3 buckets, and cross-account trust relationships.
The Databricks user who creates the storage credential in Unity Catalog must:
Be a Databricks account admin, a metastore admin, or a user with the
CREATE STORAGE CREDENTIAL
privilege.
The Databricks user who creates the external location in Unity Catalog must:
Be a metastore admin or a user with the
CREATE EXTERNAL LOCATION
privilege.
After you create an external location in Unity Catalog, you can can grant the following permissions on it:
CREATE TABLE
READ FILES
WRITE FILES
These permissions enable Databricks users to access data in S3 without managing cloud storage credentials for authentication.
For more information, see Manage external locations and storage credentials.
Access S3 buckets with Unity Catalog volumes or external locations
Use the volume path or the fully qualified S3 URI to access data secured with Unity Catalog. Because permissions are managed by Unity Catalog, you do not need to pass any additional options or configurations for authentication.
Volume paths follow the pattern /Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>
.
S3 URIs follow the pattern s3://<bucket>/<external-location>/<path>/<file-name>
.
Warning
Unity Catalog ignores Spark configuration settings when accessing data managed by external locations.
Examples of reading:
dbutils.fs.ls("s3://my-bucket/external-location/path/to/data")
spark.read.format("parquet").load("s3://my-bucket/external-location/path/to/data")
spark.sql("SELECT * FROM parquet.`s3://my-bucket/external-location/path/to/data`")
Examples of writing:
dbutils.fs.mv("s3://my-bucket/external-location/path/to/data", "s3://my-bucket/external-location/path/to/new-location")
df.write.format("parquet").save("s3://my-bucket/external-location/path/to/new-location")
Examples of creating external tables:
df.write.option("path", "s3://my-bucket/external-location/path/to/table").saveAsTable("my_table")
spark.sql("""
CREATE TABLE my_table
LOCATION "s3://my-bucket/external-location/path/to/table"
AS (SELECT *
FROM parquet.`s3://my-bucket/external-location/path/to/data`)
""")
Access S3 buckets using instance profiles
You can load IAM roles as instance profiles in Databricks and attach instance profiles to clusters to control data access to S3. Databricks recommends using instance profiles when Unity Catalog is unavailable for your environment or workload. For a tutorial on using instance profiles with Databricks, see Configure S3 access with instance profiles.
The AWS user who creates the IAM role must:
Be an AWS account user with permission to create or update IAM roles, IAM policies, S3 buckets, and cross-account trust relationships.
The Databricks user who adds the IAM role as an instance profile in Databricks must:
Be a workspace admin
Once you add the instance profile to your workspace, you can grant users, groups, or service principals have permissions to launch clusters with the instance profile. See Manage instance profiles in Databricks.
Use both cluster access control and notebook access control together to protect access to the instance profile. See Cluster access control and Collaborate using Databricks notebooks.
Access S3 buckets with URIs and AWS keys
You can set Spark properties to configure a AWS keys to access S3.
Databricks recommends using secret scopes for storing all credentials. You can grant users, service principals, and groups in your workspace access to read the secret scope. This protects the AWS key while allowing users to access S3. To create a secret scope, see Secret scopes.
The credentials can be scoped to either a cluster or a notebook. Use both cluster access control and notebook access control together to protect access to S3. See Cluster access control and Collaborate using Databricks notebooks.
To set Spark properties, use the following snippet in a cluster’s Spark configuration to set the AWS keys stored in secret scopes as environment variables:
AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}
You can then read from S3 using the following commands:
aws_bucket_name = "my-s3-bucket"
df = spark.read.load(f"s3a://{aws_bucket_name}/flowers/delta/")
display(df)
dbutils.fs.ls(f"s3a://{aws_bucket_name}/")
Access S3 with open-source Hadoop options
Databricks Runtime supports configuring the S3A filesystem using open-source Hadoop options. You can configure global properties and per-bucket properties.
Global configuration
# Global S3 configuration
spark.hadoop.fs.s3a.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.endpoint <aws-endpoint>
spark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS
Per-bucket configuration
You configure per-bucket properties using the syntax spark.hadoop.fs.s3a.bucket.<bucket-name>.<configuration-key>
. This lets you set up buckets with different credentials, endpoints, and so on.
For example, in addition to global S3 settings you can configure each bucket individually using the following keys:
# Set up authentication and endpoint for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.bucket.<bucket-name>.endpoint <aws-endpoint>
# Configure a different KMS encryption key for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.server-side-encryption.key <aws-kms-encryption-key>
Access Requester Pays buckets
To enable access to Requester Pays buckets, add the following line to your cluster’s Spark configuration:
spark.hadoop.fs.s3a.requester-pays.enabled true
Note
Databricks does not support Delta Lake writes to Requester Pays buckets.
Deprecated patterns for storing and accessing data from Databricks
The following are deprecated storage patterns:
Databricks no longer recommends mounting external data locations to Databricks Filesystem. See Mounting cloud object storage on Databricks.
Databricks no longer recommends using credential passthrough with S3. See Credential passthrough (legacy).
Important
The S3A filesystem enables caching by default and releases resources on ‘FileSystem.close()’. To avoid other threads using a reference to the cached file system incorrectly, do not explicitly use the ‘FileSystem.close().
The S3A filesystem does not remove directory markers when closing an output stream. Legacy applications based on Hadoop versions that do not include HADOOP-13230 can misinterpret them as empty directories even if there are files inside.