Mounting cloud object storage on Databricks
Important
Mounts are a legacy access pattern. Databricks recommends using Unity Catalog for managing all data access. See Connect to cloud object storage and services using Unity Catalog.
Databricks enables users to mount cloud object storage to the Databricks File System (DBFS) to simplify data access patterns for users that are unfamiliar with cloud concepts. Mounted data does not work with Unity Catalog, and Databricks recommends migrating away from using mounts and instead managing data governance with Unity Catalog.
How does Databricks mount cloud object storage?
Databricks mounts create a link between a workspace and cloud object storage, which enables you to interact with cloud object storage using familiar file paths relative to the Databricks file system. Mounts work by creating a local alias under the /mnt
directory that stores the following information:
Location of the cloud object storage.
Driver specifications to connect to the storage account or container.
Security credentials required to access the data.
What is the syntax for mounting storage?
The source
specifies the URI of the object storage (and can optionally encode security credentials). The mount_point
specifies the local path in the /mnt
directory. Some object storage sources support an optional encryption_type
argument. For some access patterns you can pass additional configuration specifications as a dictionary to extra_configs
.
Note
Databricks recommends setting mount-specific Spark and Hadoop configuration as options using extra_configs
. This ensures that configurations are tied to the mount rather than the cluster or session.
dbutils.fs.mount(
source: str,
mount_point: str,
encryption_type: Optional[str] = "",
extra_configs: Optional[dict[str:str]] = None
)
Check with your workspace and cloud administrators before configuring or altering data mounts, as improper configuration can provide unsecured access to all users in your workspace.
Note
In addition to the approaches described in this article, you can automate mounting a bucket with the Databricks Terraform provider and databricks_mount.
Unmount a mount point
To unmount a mount point, use the following command:
dbutils.fs.unmount("/mnt/<mount-name>")
Warning
To avoid errors, never modify a mount point while other jobs are reading or writing to it. After modifying a mount, always run dbutils.fs.refreshMounts()
on all other running clusters to propagate any mount updates. See refreshMounts command (dbutils.fs.refreshMounts).
Mount an S3 bucket
You can mount an S3 bucket through What is DBFS?. The mount is a pointer to an S3 location, so the data is never synced locally.
After a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, you must run dbutils.fs.refreshMounts()
on that running cluster to make the newly created mount point available.
You can use the following methods to mount an S3 bucket:
Mount a bucket using an AWS instance profile
You can manage authentication and authorization for an S3 bucket using an AWS instance profile. Access to the objects in the bucket is determined by the permissions granted to the instance profile. If the role has write access, users of the mount point can write objects in the bucket. If the role has read access, users of the mount point will be able to read objects in the bucket.
Configure your cluster with an instance profile.
Mount the bucket.
aws_bucket_name = "<aws-bucket-name>" mount_name = "<mount-name>" dbutils.fs.mount(f"s3a://{aws_bucket_name}", f"/mnt/{mount_name}") display(dbutils.fs.ls(f"/mnt/{mount_name}"))
val AwsBucketName = "<aws-bucket-name>" val MountName = "<mount-name>" dbutils.fs.mount(s"s3a://$AwsBucketName", s"/mnt/$MountName") display(dbutils.fs.ls(s"/mnt/$MountName"))
Mount a bucket using AWS keys
You can mount a bucket using AWS keys.
Important
When you mount an S3 bucket using keys, all users have read and write access to all the objects in the S3 bucket.
The following examples use Databricks secrets to store the keys. You must URL escape the secret key.
access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
dbutils.fs.mount(f"s3a://{access_key}:{encoded_secret_key}@{aws_bucket_name}", f"/mnt/{mount_name}")
display(dbutils.fs.ls(f"/mnt/{mount_name}"))
val AccessKey = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
// Encode the Secret Key as that can contain "/"
val SecretKey = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
val EncodedSecretKey = SecretKey.replace("/", "%2F")
val AwsBucketName = "<aws-bucket-name>"
val MountName = "<mount-name>"
dbutils.fs.mount(s"s3a://$AccessKey:$EncodedSecretKey@$AwsBucketName", s"/mnt/$MountName")
display(dbutils.fs.ls(s"/mnt/$MountName"))
Mount a bucket using instance profiles with the AssumeRole
policy
You must first configure Access cross-account S3 buckets with an AssumeRole policy.
Mount buckets while setting S3 options in the extraConfigs
:
dbutils.fs.mount("s3a://<s3-bucket-name>", "/mnt/<s3-bucket-name>",
extra_configs = {
"fs.s3a.credentialsType": "AssumeRole",
"fs.s3a.stsAssumeRole.arn": "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB",
"fs.s3a.canned.acl": "BucketOwnerFullControl",
"fs.s3a.acl.default": "BucketOwnerFullControl"
}
)
dbutils.fs.mount("s3a://<s3-bucket-name>", "/mnt/<s3-bucket-name>",
extraConfigs = Map(
"fs.s3a.credentialsType" -> "AssumeRole",
"fs.s3a.stsAssumeRole.arn" -> "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB",
"fs.s3a.canned.acl" -> "BucketOwnerFullControl",
"fs.s3a.acl.default" -> "BucketOwnerFullControl"
)
)
Encrypt data in S3 buckets
Databricks supports encrypting data using server-side encryption. This section covers how to use server-side encryption when writing files in S3 through DBFS. Databricks supports Amazon S3-managed encryption keys (SSE-S3) and AWS KMS–managed encryption keys (SSE-KMS).
Write files using SSE-S3
To mount your S3 bucket with SSE-S3, run the following command:
dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-s3")
To write files to the corresponding S3 bucket with SSE-S3, run:
dbutils.fs.put(s"/mnt/$MountName", "<file content>")
Write files using SSE-KMS
Mount a source directory passing in
sse-kms
orsse-kms:$KmsKey
as the encryption type.To mount your S3 bucket with SSE-KMS using the default KMS master key, run:
dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-kms")
To mount your S3 bucket with SSE-KMS using a specific KMS key, run:
dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-kms:$KmsKey")
To write files to the S3 bucket with SSE-KMS, run:
dbutils.fs.put(s"/mnt/$MountName", "<file content>")
Mounting S3 buckets with the Databricks commit service
If you plan to write to a given table stored in S3 from multiple clusters or workloads simultaneously, Databricks recommends that you Configure Databricks S3 commit services. Your notebook code must mount the bucket and add the AssumeRole
configuration. This step is necessary only for DBFS mounts, not for accessing DBFS root storage in your workspace’s root S3 bucket. The following example uses Python:
# If other code has already mounted the bucket without using the new role, unmount it first
dbutils.fs.unmount("/mnt/<mount-name>")
# mount the bucket and assume the new role
dbutils.fs.mount("s3a://<bucket-name>/", "/mnt/<mount-name>", extra_configs = {
"fs.s3a.credentialsType": "AssumeRole",
"fs.s3a.stsAssumeRole.arn": "<role-arn>"
})
Mount ADLS Gen2 or Blob Storage with ABFS
You can mount data in an Azure storage account using a Microsoft Entra ID application service principal for authentication. For more information, see Access storage using a service principal & Microsoft Entra ID(Azure Active Directory).
Important
All users in the Databricks workspace have access to the mounted ADLS Gen2 account. The service principal you use to access the ADLS Gen2 account should be granted access only to that ADLS Gen2 account; it should not be granted access to other Azure resources.
When you create a mount point through a cluster, cluster users can immediately access the mount point. To use the mount point in another running cluster, you must run
dbutils.fs.refreshMounts()
on that running cluster to make the newly created mount point available for use.Unmounting a mount point while jobs are running can lead to errors. Ensure that production jobs do not unmount storage as part of processing.
Mount points that use secrets are not automatically refreshed. If mounted storage relies on a secret that is rotated, expires, or is deleted, errors can occur, such as
401 Unauthorized
. To resolve such an error, you must unmount and remount the storage.Hierarchical namespace (HNS) must be enabled to successfully mount an Azure Data Lake Storage Gen2 storage account using the ABFS endpoint.
Run the following in your notebook to authenticate and create a mount point.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
val configs = Map(
"fs.azure.account.auth.type" -> "OAuth",
"fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id" -> "<application-id>",
"fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")
// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)
Replace
<application-id>
with the Application (client) ID for the Azure Active Directory application.<scope-name>
with the Databricks secret scope name.<service-credential-key-name>
with the name of the key containing the client secret.<directory-id>
with the Directory (tenant) ID for the Azure Active Directory application.<container-name>
with the name of a container in the ADLS Gen2 storage account.<storage-account-name>
with the ADLS Gen2 storage account name.<mount-name>
with the name of the intended mount point in DBFS.