Azure Data Lake Storage Gen1

Note

Microsoft has released its next-generation data lake store, Azure Data Lake Storage Gen2.

Azure Data Lake Storage Gen1 (formerly Azure Data Lake Store, also known as ADLS) is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake Storage Gen1 enables you to capture data of any size, type, and ingestion speed in a single place for operational and exploratory analytics. Azure Data Lake Storage Gen1 is specifically designed to enable analytics on the stored data and is tuned for performance for data analytics scenarios.

There are two ways of accessing Azure Data Lake Storage Gen1:

  1. Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a service principal and OAuth 2.0.
  2. Use a service principal directly.

Create and grant permissions to service principal

If your selected access method requires a service principal with adequate permissions, and you do not have one, follow these steps:

  1. Create an Azure AD application and service principal that can access resources. Note the following properties:
    • client-id: An ID that uniquely identifies the client application.
    • directory-id: An ID that uniquely identifies the Azure AD instance.
    • service-credential: A string that the application uses to prove its identity.
  2. Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data Lake Storage Gen1 account.

Mount Azure Data Lake Storage Gen1 resource using a service principal and OAuth 2.0

You can mount an Azure Data Lake Storage Gen1 resource or a folder inside it to Databricks File System. The mount is a pointer to data lake storage, so the data is never synced locally.

Note

Accessing Azure Data Lake Storage Gen1 requires Databricks Runtime 4.0 or above. Once an Azure Data Lake Storage Gen1 account is mounted, you can use Databricks Runtime 3.4 or above to access the mount point.

Important

  • All users in the Databricks workspace have access to the mounted Azure Data Lake Storage Gen1 account. The service client that you use to access the Azure Data Lake Storage Gen1 account should be granted access only to that Azure Data Lake Storage Gen1 account; it should not be granted access to other resources in Azure.
  • Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

DBFS uses the credential you provide when you create the mount point to access the mounted Azure Data Lake Storage Gen1 account.

Mount Azure Data Lake Storage Gen1 resource or folder

Note

As of Databricks Runtime 6.0, the dfs.adls. prefix for Azure Data Lake Storage Gen1 configuration keys has been deprecated in favor of the new fs.adl. prefix. Backward compatibility is maintained, which means you can still use old prefix. However, there are two caveats when using the old prefix. The first is that even though keys using the old prefix will be correctly propagated, calling spark.conf.get with a key using the new prefix will fail unless set explicitly. The second is that and any error message referencing an Azure Data Lake Storage Gen1 configuration key will always use the new prefix. For Databricks Runtime versions prior to 6.0, you must always use the old prefix.

To mount an Azure Data Lake Storage Gen1 resource or a folder inside it, use the following command:

Scala
val configs = Map(
  "fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  "fs.adl.oauth2.client.id" -> "<client-id>",
  "fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
  "fs.adl.oauth2.refresh.url" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")

// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)
Python
configs = {"fs.adl.oauth2.access.token.provider.type": "ClientCredential",
           "fs.adl.oauth2.client.id": "<client-id>",
           "fs.adl.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
           "fs.adl.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

where

<mount-name> is a DBFS path that represents where the Azure Data Lake Storage Gen1 account or a folder inside it (specified in source) will be mounted in DBFS and dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your storage account access key that has been stored as a secret in a secret scope.

Access files in your container as if they were local files, for example:

Scala
val df = spark.read.text("/mnt/<mount-name>/....")
val df = spark.read.text("dbfs:/<mount-name>/....")
Python
df = spark.read.text("/mnt/%s/...." % <mount-name>)
df = spark.read.text("dbfs:/<mount-name>/....")

Unmount a mount point

To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

Access directly with Spark APIs using a service principal and OAuth 2.0

You can access an Azure Data Lake Storage Gen1 storage account directly (as opposed to mounting with DBFS) with OAuth 2.0 using the service principal.

Access using the DataFrame API

To read from your Azure Data Lake Storage Gen1 account, you can configure Spark to use service credentials with the following snippet in your notebook:

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<client-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

where dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your storage account access key that has been stored as a secret in a secret scope.

Once your credentials are set up, you can use standard Spark and Databricks APIs to read from the resource. For example:

val df = spark.read.parquet("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

dbutils.fs.ls("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

Azure Data Lake Storage Gen1 provides directory level access control, so the service principal must have access to the directories that you want to read from as well as the Azure Data Lake Storage Gen1 resource.

Note

As of Databricks Runtime 6.0, the dfs.adls. prefix for Azure Data Lake Storage Gen1 configuration keys has been deprecated in favor of the new fs.adl. prefix. Backward compatibility is maintained, which means you can still use old prefix. However, there are two caveats when using the old prefix. The first is that even though keys using the old prefix will be correctly propagated, calling spark.conf.get with a key using the new prefix will fail unless set explicitly. The second is that and any error message referencing an Azure Data Lake Storage Gen1 configuration key will always use the new prefix. For Databricks Runtime versions prior to 6.0, you must always use the old prefix.

Access with the RDD API

Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. If you are using the RDD API to read from Azure Data Lake Storage Gen1, you must set the credentials using one of the following methods:

  • Specify the Hadoop credential configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations used for your RDD jobs:

    spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential
    spark.hadoop.fs.adl.oauth2.client.id <client-id>
    spark.hadoop.fs.adl.oauth2.credential <service-credential>
    spark.hadoop.fs.adl.oauth2.refresh.url "https://login.microsoftonline.com/<directory-id>/oauth2/token"
    
  • Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    spark.sparkContext.hadoopConfiguration.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
    spark.sparkContext.hadoopConfiguration.set("fs.adl.oauth2.client.id", "<client-id>")
    spark.sparkContext.hadoopConfiguration.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
    spark.sparkContext.hadoopConfiguration.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
    

    where dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your storage account access key that has been stored as a secret in a secret scope.

Note

As of Databricks Runtime 6.0, the dfs.adls. prefix for Azure Data Lake Storage Gen1 configuration keys has been deprecated in favor of the new fs.adl. prefix. Backward compatibility is maintained, which means you can still use old prefix. However, there are two caveats when using the old prefix. The first is that even though keys using the old prefix will be correctly propagated, calling spark.conf.get with a key using the new prefix will fail unless set explicitly. The second is that and any error message referencing an Azure Data Lake Storage Gen1 configuration key will always use the new prefix. For Databricks Runtime versions prior to 6.0, you must always use the old prefix.

Warning

These credentials are available to all users who access the cluster.

Access through metastore

To access adl:// locations specified in the metastore, you must specify Hadoop credential configuration options as Spark options when you create the cluster by adding the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations used by the metastore:

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential
spark.hadoop.fs.adl.oauth2.client.id <client-id>
spark.hadoop.fs.adl.oauth2.credential <service-credential>
spark.hadoop.fs.adl.oauth2.refresh.url https://login.microsoftonline.com/<directory-id>/oauth2/token

Note

As of Databricks Runtime 6.0, the dfs.adls. prefix for Azure Data Lake Storage Gen1 configuration keys has been deprecated in favor of the new fs.adl. prefix. Backward compatibility is maintained, which means you can still use old prefix. However, there are two caveats when using the old prefix. The first is that even though keys using the old prefix will be correctly propagated, calling spark.conf.get with a key using the new prefix will fail unless set explicitly. The second is that and any error message referencing an Azure Data Lake Storage Gen1 configuration key will always use the new prefix. For Databricks Runtime versions prior to 6.0, you must always use the old prefix.

Warning

These credentials are available to all users who access the cluster.

Set up service credentials for multiple accounts

Note

Requires Databricks Runtime 6.0 or above.

You can set up service credentials for multiple Azure Data Lake Storage Gen1 accounts for use within in a single Spark session by adding account.<account-name> to the configuration keys. For example, if you want to set up credentials for both the accounts to access adl://example1.azuredatalakestore.net and adl://example2.azuredatalakestore.net, you can do this as follows:

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")

spark.conf.set("fs.adl.account.example1.oauth2.client.id", "<client-id-example1>")
spark.conf.set("fs.adl.account.example1.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example1>"))
spark.conf.set("fs.adl.account.example1.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example1>/oauth2/token")

spark.conf.set("fs.adl.account.example2.oauth2.client.id", "<client-id-example2>")
spark.conf.set("fs.adl.account.example2.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example2>"))
spark.conf.set("fs.adl.account.example2.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example2>/oauth2/token")

This also works for the cluster Spark configuration:

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential

spark.hadoop.fs.adl.account.example1.oauth2.client.id <client-id-example1>
spark.hadoop.fs.adl.account.example1.oauth2.credential <service-credential-example1>
spark.hadoop.fs.adl.account.example1.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example1>/oauth2/token

spark.hadoop.fs.adl.account.example2.oauth2.client.id <client-id-example2>
spark.hadoop.fs.adl.account.example2.oauth2.credential <service-credential-example2>
spark.hadoop.fs.adl.account.example2.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example2>/oauth2/token

The following notebook demonstrates how to access Azure Data Lake Storage Gen1 directly and with a mount.

ADLS Gen1 service principal notebook