Access Azure Data Lake Storage Gen2 and Blob Storage
Use the Azure Blob Filesystem driver (ABFS) to connect to Azure Blob Storage and Azure Data Lake Storage Gen2 from Databricks. Databricks recommends securing access to Azure storage containers by using Azure service principals set in cluster configurations.
Note
Databricks no longer recommends mounting external data locations to Databricks Filesystem. See Mounting cloud object storage on Databricks.
This article details how to access Azure storage containers using:
Unity Catalog managed external locations
Azure service principals
SAS tokens
Account keys
You will set Spark properties to configure these credentials for a compute environment, either:
Scoped to a Databricks cluster
Scoped to a Databricks notebook
Azure service principals can also be used to access Azure storage from Databricks SQL; see Data access configuration.
Databricks recommends using secret scopes for storing all credentials.
Deprecated patterns for storing and accessing data from Databricks
The following are deprecated storage patterns:
The legacy Windows Azure Storage Blob driver (WASB) has been deprecated. ABFS has numerous benefits over WASB. See Azure documentation on ABFS. For documentation for working with the legacy WASB driver, see Connect to Azure Blob Storage with WASB (legacy).
Azure has announced the pending retirement of Azure Data Lake Storage Gen1. Databricks recommends migrating all Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2. If you have not yet migrated, see Accessing Azure Data Lake Storage Gen1 from Databricks.
Direct access using ABFS URI for Blob Storage or Azure Data Lake Storage Gen2
If you have properly configured credentials to access your Azure storage container, you can interact with resources in the storage account using URIs. Databricks recommends using the abfss
driver for greater security.
spark.read.load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
CREATE TABLE <database-name>.<table-name>;
COPY INTO <database-name>.<table-name>
FROM 'abfss://container@storageAccount.dfs.core.windows.net/path/to/folder'
FILEFORMAT = CSV
COPY_OPTIONS ('mergeSchema' = 'true');
Access Azure Data Lake Storage Gen2 or Blob Storage using OAuth 2.0 with an Azure service principal
You can securely access data in an Azure storage account using OAuth 2.0 with an Azure Active Directory (Azure AD) application service principal for authentication; see Access storage with Azure Active Directory.
service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
Replace
<scope>
with the Databricks secret scope name.<service-credential-key>
with the name of the key containing the client secret.<storage-account>
with the name of the Azure storage account.<application-id>
with the Application (client) ID for the Azure Active Directory application.<directory-id>
with the Directory (tenant) ID for the Azure Active Directory application.
Access Azure Data Lake Storage Gen2 or Blob Storage using a SAS token
You can use storage shared access signatures (SAS) to access an Azure Data Lake Storage Gen2 storage account directly. With SAS, you can restrict access to a storage account using temporary tokens with fine-grained access control.
You can configure SAS tokens for multiple storage accounts in the same Spark session.
Note
SAS support is available in Databricks Runtime 7.5 and above.
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "<token>")
Access Azure Data Lake Storage Gen2 or Blob Storage using the account key
You can use storage account access keys to manage access to Azure Storage.
spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
Replace
<storage-account>
with the Azure Storage account name.<scope>
with the Databricks secret scope name.<storage-account-access-key>
with the name of the key containing the Azure storage account access key.