Accessing Azure Data Lake Storage Gen1 from Databricks
Microsoft has announced the planned retirement of Azure Data Lake Storage Gen1 (formerly Azure Data Lake Store, also known as ADLS) and recommends all users migrate to Azure Data Lake Storage Gen2. Databricks recommends upgrading to Azure Data Lake Storage Gen2 for best performance and new features.
You can access Azure Data Lake Storage Gen1 directly using a service principal.
Create and grant permissions to service principal
If your selected access method requires a service principal with adequate permissions, and you do not have one, follow these steps:
Create a Microsoft Entra ID (formerly Azure Active Directory) application and service principal that can access resources. Note the following properties:
application-id
: An ID that uniquely identifies the client application.directory-id
: An ID that uniquely identifies the Microsoft Entra ID instance.service-credential
: A string that the application uses to prove its identity.
Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data Lake Storage Gen1 account.
Access directly with Spark APIs using a service principal and OAuth 2.0
To read from your Azure Data Lake Storage Gen1 account, you can configure Spark to use service credentials with the following snippet in your notebook:
spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
where
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")
retrieves your storage account access key that has been stored as a secret in a secret scope.
After you’ve set up your credentials, you can use standard Spark and Databricks APIs to access the resources. For example:
val df = spark.read.format("parquet").load("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")
dbutils.fs.ls("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")
Azure Data Lake Storage Gen1 provides directory level access control, so the service principal must have access to the directories that you want to read from as well as the Azure Data Lake Storage Gen1 resource.
Access through metastore
To access adl://
locations specified in the metastore, you must specify Hadoop credential configuration options as Spark options when you create the cluster by adding the spark.hadoop.
prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations used by the metastore:
spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential
spark.hadoop.fs.adl.oauth2.client.id <application-id>
spark.hadoop.fs.adl.oauth2.credential <service-credential>
spark.hadoop.fs.adl.oauth2.refresh.url https://login.microsoftonline.com/<directory-id>/oauth2/token
Warning
These credentials are available to all users who access the cluster.
Mount Azure Data Lake Storage Gen1 resource or folder
To mount an Azure Data Lake Storage Gen1 resource or a folder inside it, use the following command:
configs = {"fs.adl.oauth2.access.token.provider.type": "ClientCredential",
"fs.adl.oauth2.client.id": "<application-id>",
"fs.adl.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
"fs.adl.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
val configs = Map(
"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
"fs.adl.oauth2.client.id" -> "<application-id>",
"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
"fs.adl.oauth2.refresh.url" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")
// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)
where
<mount-name>
is a DBFS path that represents where the Azure Data Lake Storage Gen1 account or a folder inside it (specified insource
) will be mounted in DBFS.dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")
retrieves your storage account access key that has been stored as a secret in a secret scope.
Access files in your container as if they were local files, for example:
df = spark.read.format("text").load("/mnt/<mount-name>/....")
df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")
val df = spark.read.format("text").load("/mnt/<mount-name>/....")
val df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")
Set up service credentials for multiple accounts
You can set up service credentials for multiple Azure Data Lake Storage Gen1 accounts for use within in a single Spark session by adding account.<account-name>
to the configuration keys. For example, if you want to set up credentials for both the accounts to access adl://example1.azuredatalakestore.net
and adl://example2.azuredatalakestore.net
, you can do this as follows:
spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.account.example1.oauth2.client.id", "<application-id-example1>")
spark.conf.set("fs.adl.account.example1.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example1>"))
spark.conf.set("fs.adl.account.example1.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example1>/oauth2/token")
spark.conf.set("fs.adl.account.example2.oauth2.client.id", "<application-id-example2>")
spark.conf.set("fs.adl.account.example2.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example2>"))
spark.conf.set("fs.adl.account.example2.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example2>/oauth2/token")
This also works for the cluster Spark configuration:
spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential
spark.hadoop.fs.adl.account.example1.oauth2.client.id <application-id-example1>
spark.hadoop.fs.adl.account.example1.oauth2.credential <service-credential-example1>
spark.hadoop.fs.adl.account.example1.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example1>/oauth2/token
spark.hadoop.fs.adl.account.example2.oauth2.client.id <application-id-example2>
spark.hadoop.fs.adl.account.example2.oauth2.credential <service-credential-example2>
spark.hadoop.fs.adl.account.example2.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example2>/oauth2/token
The following notebook demonstrates how to access Azure Data Lake Storage Gen1 directly and with a mount.