Skip to main content

Connect to Azure Data Lake Storage and Blob Storage

warning

This article describes a legacy pattern for accessing Azure Data Lake Storage (ADLS) and Blob Storage from a non-Azure Databricks workspace which bypasses Unity Catalog governance. Use it only if Unity Catalog governance is not required for the data in this storage account.

This article explains how to connect to Azure Data Lake Storage and Blob Storage from Databricks.

note

The legacy Windows Azure Storage Blob driver (WASB) has been deprecated. ABFS has numerous benefits over WASB. See Azure documentation on ABFS. For documentation for working with the legacy WASB driver, see Connect to Azure Blob Storage with WASB (legacy).

Step 1: Register a Microsoft Entra ID application

Registering an application with Microsoft Entra ID creates a service principal you can use to provide access to Azure storage accounts.

To register a Microsoft Entra ID application, you must have the Application Administrator role or the Application.ReadWrite.All permission in Microsoft Entra ID.

  1. In the Azure portal, go to the Microsoft Entra ID service.
  2. Under Manage, click App Registrations.
  3. Click + New registration. Enter a name for the application and click Register.
  4. Click Certificates & Secrets.
  5. Click + New client secret.
  6. Add a description for the secret and click Add.
  7. Copy and save the value for the new secret.
  8. In the application registration overview, copy and save the Application (client) ID and Directory (tenant) ID.

Step 2: Assign roles to the service principal

You control access to storage resources by assigning roles to a Microsoft Entra ID application registration associated with the storage account. You might need to assign other roles depending on specific requirements.

To assign roles on a storage account you must have the Owner or User Access Administrator Azure RBAC role on the storage account.

  1. In the Azure portal, go to the Storage accounts service.
  2. Select an Azure storage account to use with this application registration.
  3. Click Access Control (IAM).
  4. Click + Add and select Add role assignment from the dropdown menu.
  5. Set the Select field to the Microsoft Entra ID application name and set Role to Storage Blob Data Contributor.
  6. Click Save.

Step 3: Configure Azure credentials in Databricks

Configure your Databricks cluster or notebook with the credentials for the Azure storage account you want to access.

Supported credential types and secret storage

The following credentials can be used to access Azure Data Lake Storage or Blob Storage:

  • OAuth 2.0 with a Microsoft Entra ID service principal: Databricks recommends using Microsoft Entra ID service principals to connect to Azure Data Lake Storage. To create a Microsoft Entra ID service principal and provide it access to Azure storage accounts, complete Steps 1 and 2 above.

    To create a Microsoft Entra ID service principal, you must have the Application Administrator role or the Application.ReadWrite.All permission in Microsoft Entra ID. To assign roles on a storage account you must be an Owner or a user with the User Access Administrator Azure RBAC role on the storage account.

    important

    Blob storage does not support Microsoft Entra ID service principals.

  • Shared access signatures (SAS): You can use storage SAS tokens to access Azure storage. With SAS, you can restrict access to a storage account using temporary tokens with fine-grained access control.

    You can only grant a SAS token permissions that you have on the storage account, container, or file yourself.

  • Account keys: You can use storage account access keys to manage access to Azure Storage. Storage account access keys provide full access to the configuration of a storage account, as well as the data. Databricks recommends using a Microsoft Entra ID service principal or a SAS token to connect to Azure storage instead of account keys.

    To view an account's access keys, you must have the Owner, Contributor, or Storage Account Key Operator Service role on the storage account.

Databricks recommends using secret scopes for storing all credentials. You can grant users, service principals, and groups in your workspace access to read the secret scope. This protects the Azure credentials while allowing users to access Azure storage. To create a secret scope, see Manage secret scopes.

Set Spark properties to configure Azure credentials

You can set Spark properties to configure Azure credentials to access Azure storage. The credentials can be scoped to either a cluster or a notebook. Use both cluster access control and notebook access control together to protect access to Azure storage. See Compute permissions and Collaborate using Databricks notebooks.

To set Spark properties, use the following snippet in a cluster's Spark configuration or a notebook:

Use the following format to set the cluster Spark configuration:

ini
spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets/<secret-scope>/<service-credential-key>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token

You can use spark.conf.set in notebooks, as shown in the following example:

Python
service_credential = dbutils.secrets.get(scope="<secret-scope>",key="<service-credential-key>")

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

Replace

  • <secret-scope> with the Databricks secret scope name.
  • <service-credential-key> with the name of the key containing the client secret.
  • <storage-account> with the name of the Azure storage account.
  • <application-id> with the Application (client) ID for the Microsoft Entra ID application.
  • <directory-id> with the Directory (tenant) ID for the Microsoft Entra ID application.

Step 4: Access Azure storage

Once you have properly configured credentials to access your Azure storage container, you can interact with resources in the storage account using URIs. Databricks recommends using the abfss driver for greater security.

Python
spark.read.load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")

dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
SQL
CREATE TABLE <database-name>.<table-name>;

COPY INTO <database-name>.<table-name>
FROM 'abfss://container@storageAccount.dfs.core.windows.net/path/to/folder'
FILEFORMAT = CSV
COPY_OPTIONS ('mergeSchema' = 'true');

Example notebook

ADLS OAuth 2.0 with Microsoft Entra ID (formerly Azure Active Directory) service principals notebook

Open notebook in new tab

Azure Data Lake Storage known issues

If you try accessing a storage container created through the Azure portal, you might receive the following error:

StatusCode=404
StatusDescription=The specified filesystem does not exist.
ErrorCode=FilesystemNotFound
ErrorMessage=The specified filesystem does not exist.

When a hierarchical namespace is enabled, you don't need to create containers through Azure portal. If you see this issue, delete the Blob container through Azure portal. After a few minutes, you can access the container. Alternatively, you can change your abfss URI to use a different container, as long as this container is not created through Azure portal.

See Known issues with Azure Data Lake Storage in the Microsoft documentation.

Deprecated patterns for storing and accessing data from Databricks

The following are deprecated storage patterns: