Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

You can securely access data in an Azure Data Lake Storage Gen2 (ADLS Gen2) account using OAuth 2.0 with an Azure Active Directory (Azure AD) application service principal for authentication. Using a service principal for authentication provides two options for accessing data in your storage account:

  • A mount point to a specific file or path
  • Direct access to data

The option to select depends on how you plan to use Databricks with ADLS Gen2 storage:

  • To provide access to a specific path or file to multiple workspace users, create a mount point to the required storage resource and path.
  • To provide access to multiple workspace users with different permissions, access data directly through the Azure Blob File System (ABFS) driver.

Mount points also provide the benefit of being easily accessible across a workspace using standard file system semantics. In contrast, direct access paths need to be fully specified in your notebooks. This feature of mount points may provide a better user experience for multiple users accessing common resources in a workspace.

This article describes creating an Azure AD application and service principal and using that service principal to mount or directly access data in an ADLS Gen2 storage account. The following is an overview of the tasks this article walks through:

  1. Create an Azure AD application, which will create an associated service principal used to access the storage account.
  2. Create a secret scope in your Databricks workspace. The secret scope will securely store the client secret associated with the Azure AD application.
  3. Save the client secret associated with the Azure AD application in the secret scope. The client secret is required for authenticating to the storage account. The secret scope provides secure storage of the secret and allows it to be used without directly referencing it in configuration.
  4. Assign roles to the application to provide the service principal with the required permissions to access the ADLS Gen2 storage account.
  5. Create one or more containers inside the storage account. Like directories in a filesystem, containers provide a way to organize objects in an Azure storage account. You’ll need to create one or more containers before you can access an ADLS Gen2 storage account.
  6. Authenticate and access the ADLS Gen2 storage account through direct access.

Requirements

Register an Azure Active Directory application

Registering an Azure AD application and assigning appropriate permissions will create a service principal that can access ADLS Gen2 storage resources.

  1. In the Azure portal, go to the Azure Active Directory service.

  2. Under Manage, click App Registrations.

  3. Click + New registration. Enter a name for the application and click Register.

  4. Click Certificates & Secrets.

  5. Click + New client secret.

  6. Add a description for the secret and click Add.

  7. Copy and save the value for the new secret.

  8. In the application registration overview, copy and save the Application (client) ID and Directory (tenant) ID.

    App registration overview

Add the client secret to a secret scope

Use the Databricks CLI to Create a Databricks-backed secret scope and create a new secret for the client secret:

databricks secrets create-scope --scope <scope-name>
databricks secrets put --scope <scope-name> --key <key-name>

Replace

  • <scope-name> with a name for the new scope.
  • <key-name> with a name for the secret value.

Running the put command will open an editor. Enter the value for the client secret above the line and save and exit the editor.

Assign roles

You control access to storage resources by assigning roles to an Azure AD application registration associated with the storage account. This example assigns the Storage Blob Data Contributor to the ADLS Gen2 storage account. You may need to assign other roles depending on specific requirements.

  1. In the Azure portal, go to the Storage accounts service.

  2. Select the ADLS Gen2 account to use with this application registration.

  3. Click Access Control (IAM).

  4. Click + Add and select Add role assignment from the dropdown menu.

  5. Set the Select field to the Azure AD application name and set Role to Storage Blob Data Contributor.

  6. Click Save.

    Assign application roles

Create a container

Like directories in a filesystem, containers provide a way to organize objects in an Azure storage account. You’ll need to create one or more containers before you can access an ADLS Gen2 storage account. You can create a container directly in a Databricks notebook or through the Azure command-line interface, the Azure API, or the Azure portal. To create a container through the portal:

  1. In the Azure portal, go to Storage accounts.

  2. Select your ADLS Gen2 account and click Containers.

  3. Click + Container.

  4. Enter a name for your container and click Create.

    Create a container

Mount ADLS Gen2 storage

To mount ADLS Gen2 storage:

  1. Configure OAuth 2.0 authentication to the ADLS Gen2 storage account, using the service principal as the credentials.
  2. Create the mount point through the Databricks API.

Important

  • All users in the Databricks workspace have access to the mounted ADLS Gen2 account. The service principal you use to access the ADLS Gen2 account should be granted access only to that ADLS Gen2 account; it should not be granted access to other Azure resources.
  • When you create a mount point through a cluster, cluster users can immediately access the mount point. To use the mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.
  • Unmounting a mount point while jobs are running can lead to errors. Ensure that production jobs do not unmount storage as part of processing.

Run the following in your notebook to authenticate and create a mount point.

configs = {"fs.azure.account.auth.type": "OAuth",
          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
          "fs.azure.account.oauth2.client.id": "<application-id>",
          "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)
val configs = Map(
  "fs.azure.account.auth.type" -> "OAuth",
  "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id" -> "<application-id>",
  "fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
  "fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")
// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Replace

  • <application-id> with the Application (client) ID for the Azure Active Directory application.
  • <scope-name> with the Databricks secret scope name.
  • <service-credential-key-name> with the name of the key containing the client secret.
  • <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.
  • <container-name> with the name of a container in the ADLS Gen2 storage account.
  • <storage-account-name> with the ADLS Gen2 storage account name.
  • <mount-name> with the name of the intended mount point in DBFS.

Access files in your ADLS Gen2 filesystem as if they were files in DBFS:

df = spark.read.text("/mnt/%s/...." % <mount-name>)
df = spark.read.text("dbfs:/mnt/<mount-name>/....")
val df = spark.read.text("/mnt/<mount-name>/....")
val df = spark.read.text("dbfs:/mnt/<mount-name>/....")

To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

Access ADLS Gen2 directly

The way you pass credentials to access storage resources directly depends on whether you plan to use the DataFrame or Dataset API, or the RDD API.

DataFrame or DataSet API

If you are using Spark DataFrame or Dataset APIs, Databricks recommends that you set your account credentials in your notebook’s session configs:

spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net", dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

Replace

  • <storage-account-name> with the name of the ADLS Gen2 storage account.
  • <application-id> with the Application (client) ID for the Azure Active Directory application.
  • <scope-name> with the Databricks secret scope name.
  • <service-credential-key-name> with the name of the key containing the client secret.
  • <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.

RDD API

If you use the RDD API to access ADLS Gen2, you cannot access Hadoop configuration options set using spark.conf.set(...). Instead, specify the Hadoop configuration options as Spark configs when you create the cluster. You must add the spark.hadoop. prefix to the Hadoop configuration keys to propagate them to the Hadoop configurations used by your RDD jobs.

Warning

These credentials are available to all users who access the cluster.

spark.hadoop.fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net <service-credential>
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token

Replace

  • <storage-account-name> with the name of the ADLS Gen2 storage account.
  • <application-id> with the Application (client) ID for the Azure Active Directory application.
  • <service-credential> with the value of the client secret.
  • <service-credential-key-name> with the name of the key containing the client secret.
  • <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.

Use standard Spark and Databricks APIs to read from the storage account:

val df = spark.read.parquet("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")

dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")

Replace

  • <container-name> with the name of a container in the ADLS Gen2 storage account.
  • <storage-account-name> with the ADLS Gen2 storage account name.
  • <directory-name> with an optional path in the storage account.

Example notebook

This notebook demonstrates using a service principal to:

  1. Authenticate to an ADLS Gen2 storage account.
  2. Mount a filesystem in the storage account.
  3. Write a JSON file containing Internet of things (IoT) data to the new container.
  4. List files using direct access and through the mount point.
  5. Read and display the IoT file using direct access and through the mount point.

ADLS Gen2 OAuth 2.0 with Azure service principals notebook

Open notebook in new tab