Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 is a next-generation data lake solution for big data analytics. Azure Data Lake Storage Gen2 builds Azure Data Lake Storage Gen1 capabilities–such as file system semantics, file-level security, and scale–into Azure Blob Storage, with its low-cost tiered storage, high availability, and disaster recovery features.

Note

  • Azure Data Lake Storage Gen2 is in Public Preview.
  • Databricks integration with Azure Data Lake Storage Gen2 is fully supported in Databricks Runtime 5.1. If you have been using Azure Data Lake Storage Gen2 with Databricks Runtime 4.2 to 5.0, you can continue to do so, but support is limited and we strongly recommend that you upgrade your clusters to 5.1.

This topic explains how to access Azure Data Lake Storage Gen2 using the Azure Blob File System (ABFS) driver built into the Databricks Runtime.

Requirements

  • Databricks Runtime 5.1 or above. If you have been using Azure Data Lake Storage Gen2 with Databricks Runtime 4.2 to 5.0, you can continue to do so, but support is limited and we strongly recommend that you upgrade your clusters to Databricks Runtime 5.1.
  • For OAuth 2.0 access, supported in Databricks Runtime 5.1 and above, you must have a service principal. If you do not already have a service principal, you can follow the instructions in Create service principal with portal to create one and give it “Storage Blob Data Contributor” role in your storage account. If you do not know your-directory-id (also referred to as tenant ID in Azure Active Directory), you can follow the instructions in Get tenant ID. For leveraging credentials safely in Databricks, we recommend that you follow the Secrets user guide.

Create an Azure Data Lake Storage Gen2 account and initialize a filesystem

When you create your Azure Data Lake Storage Gen2 storage account, you must:

  1. Note the access key for the storage account.

  2. Enable the hierarchical namespace, which provides improved filesystem performance, POSIX ACLs, and filesystem semantics that are familiar to analytics engines and frameworks.

    Important

    • When the hierarchical namespace is enabled for an Azure Data Lake Storage Gen2 account, you do not need to create any Blob containers through the Azure Portal.
    • When the hierarchical namespace is enabled, Azure Blob Storage APIs are not available. See this Known issue description. For example, you cannot use the wasb or wasbs scheme to access the blob.core.windows.net endpoint.
    • While Azure Data Lake Storage Gen2 remains in Preview, if you enable the hierarchical namespace there is no interoperability of data or operations between Azure Blob Storage and Data Lake Storage Gen2 REST APIs.

Once the hierarchical namespace is enabled for your storage account, you must initialize a filesystem before you can access it. To do so, you typically enter the following in the first cell of the notebook (with your account values):

// You can also use OAuth 2 with Databricks Runtime 5.1 or above to authenticate the filesystem initialization.
spark.conf.set("fs.azure.account.key.<account_name>.dfs.core.windows.net", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>"))
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://<file_system>@<account_name>.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")

where dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your storage account access key that has been stored as a secret in a secret scope.

Mount an Azure Data Lake Storage Gen2 filesystem with DBFS

You can mount an Azure Data Lake Storage Gen2 filesystem using Databricks Runtime 5.1 or above. Mounting with DBFS requires that you use OAuth 2.0 for authentication. See Requirements for details about creating a service principal to support OAuth.

The mount is a pointer to data lake storage, so the data is never synced locally.

  1. To mount an Azure Data Lake Storage Gen2 filesystem or a folder inside it, use the following command:

    Scala
    val configs = Map(
      "fs.azure.account.auth.type" -> "OAuth",
      "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
      "fs.azure.account.oauth2.client.id" -> "<your-service-client-id>",
      "fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>"),
      "fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")
    
    // Optionally, you can add <your-directory-name> to the source URI of your mount point.
    dbutils.fs.mount(
      source = "abfss://<your-file-system-name>@<your-storage-account-name>.dfs.core.windows.net/",
      mountPoint = "/mnt/<mount-name>",
      extraConfigs = configs)
    
    Python
    configs = {"fs.azure.account.auth.type": "OAuth",
               "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
               "fs.azure.account.oauth2.client.id": "<your-service-client-id>",
               "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>"),
               "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}
    
    # Optionally, you can add <your-directory-name> to the source URI of your mount point.
    dbutils.fs.mount(
      source = "abfss://<your-file-system-name>@<your-storage-account-name>.dfs.core.windows.net/",
      mount_point = "/mnt/<mount-name>",
      extra_configs = configs)
    

    where

    • <mount-name> is a DBFS path that represents where the Data Lake Store or a folder inside it (specified in source) will be mounted in DBFS.
    • dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your service credential that has been stored as a secret in a secret scope.
  2. Access files in your Azure Data Lake Storage Gen2 filesystem as if they were files in DBFS; for example:

    Scala
    val df = spark.read.text("/mnt/<mount-name>/....")
    val df = spark.read.text("dbfs:/<mount-name>/....")
    
    Python
    df = spark.read.text("/mnt/%s/...." % <mount-name>)
    df = spark.read.text("dbfs:/<mount-name>/....")
    

Unmount a mount point

To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

Access an Azure Data Lake Storage Gen2 account directly

To access an Azure Data Lake Storage Gen2 storage account, we recommend that you set your account credentials in your notebook’s session configs.

There are two kinds of credentials that you can use:

  • Account access key:

    spark.conf.set(
      "fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
      dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>"))
    

    where dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your storage account access key that has been stored as a secret in a secret scope.

  • OAuth 2.0 (see Requirements for details about creating a service principal to support OAuth):

    spark.conf.set("fs.azure.account.auth.type", "OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.client.id", "<your-service-client-id>")
    spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>"))
    spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")
    

    where dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your service credential that has been stored as a secret in a secret scope.

Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. If you are using the RDD API to read from Azure Data Lake Storage Gen2, you must set the credentials using one of the following methods:

  • Specify the Hadoop configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations that are used for your RDD jobs:

    # Using an account access key
    spark.hadoop.fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net <your-storage-account-access-key>
    
    # Or, using OAuth 2 with Databricks Runtime 5.1 or above
    spark.hadoop.fs.azure.account.auth.type OAuth
    spark.hadoop.fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    spark.hadoop.fs.azure.account.oauth2.client.id <your-service-client-id>
    spark.hadoop.fs.azure.account.oauth2.client.secret <your-service-credentials>
    spark.hadoop.fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<your-directory-id>/oauth2/token
    
  • Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    // Using an account access key
    spark.sparkContext.hadoopConfiguration.set(
      "fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
      dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-your-storage-account-access-key>")
    )
    
    // Or, using OAuth 2 with Databricks Runtime 5.1 or above
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.auth.type", "OAuth")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth.provider.type",  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.id", "<your-service-client-id>")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-your-service-credential>"))
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")
    

Warning

The credentials set in the Hadoop configuration are available to all users who access the cluster.

Once this is set up, you can use standard Spark and Databricks APIs to read from the storage account. For example,

val df = spark.read.parquet("abfss://<your-file-system-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-name>")

dbutils.fs.ls("abfss://<your-file-system-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-name>")

Frequently asked questions (FAQ)

Does ABFS support Shared Access Signature (SAS) token authentication?
ABFS does not support SAS token authentication, but the Azure Data Lake Storage Gen2 service itself does support SAS keys.
Can I use the abfs scheme to access Azure Data Lake Storage Gen2?
Yes. However, we recommend that you use the abfss scheme, which uses SSL encrypted access, wherever possible.
When I accessed an Azure Data Lake Storage Gen2 account with the hierarchical namespace enabled, I experienced a java.io.FileNotFoundException error, and the error message includes FilesystemNotFound.

If the error message includes the following information, it is because your command is trying to access a Blob Storage container created through the Azure Portal.

StatusCode=404
StatusDescription=The specified filesystem does not exist.
ErrorCode=FilesystemNotFound
ErrorMessage=The specified filesystem does not exist.

When a hierarchical namespace is enabled, you do not need to create containers through Azure Portal. If you see this issue, delete the Blob container through Azure Portal. After a few minutes, you will be able to access the container. Alternatively, you can change your abfss URI to use a different container, as long as this container is not created through Azure Portal.

When I use Databricks Runtime 4.2, 4.3, or 5.0, Azure Data Lake Storage Gen2 fails to list a directory that has lots of files.

With these runtimes, Azure Data Lake Storage Gen2 has a known issue that causes it to fail to list a directory that has more than 5000 files or sub-directories. The error message should look like this:

java.io.IOException: GET https://....dfs.core.windows.net/...?resource=filesystem&maxResults=5000&directory=...&timeout=90&recursive=false
StatusCode=403
StatusDescription=Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
ErrorCode=AuthenticationFailed
ErrorMessage=Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:...
Time:...