Azure Data Lake Storage Gen2

Note

Azure Data Lake Storage Gen2 support is in Public Preview.

Note

Databricks Runtime 4.2 and above provide built-in support for Azure Data Lake Storage Gen2.

This topic explains how to access Azure Data Lake Storage Gen2 using the ABFS driver built into Databricks Runtime.

Requirements

An Azure Data Lake Storage Gen2 storage account and the access key for the storage account.

Access Azure Data Lake Storage Gen2 directly

To set up the credential of a Azure Storage account for Azure Data Lake Storage Gen2, we recommend that you set the credential in the session configs of your notebook.

spark.conf.set(
  "fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
  "<your-storage-account-access-key>")

Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. If you are using the RDD API to read from Azure Data Lake Storage Gen2, you must set the credentials using one of the following methods:

  • Specify the Hadoop configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations that are used for your RDD jobs:

    # Using an account access key
    spark.hadoop.fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net <your-storage-account-access-key>
    
  • Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    // Using an account access key
    spark.sparkContext.hadoopConfiguration.set(
      "fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
      "<your-storage-account-access-key>"
    )
    

Warning

The credentials set in the Hadoop configuration are available to all users who access the cluster.

Once an account access key is set up, you can use standard Spark and Databricks APIs to read from the storage account. For example,

val df = spark.read.parquet("abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-name>")

dbutils.fs.ls("abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-name>")

Hierarchical Namespace

Azure Data Lake Storage Gen2 has a hierarchical namespace, which provides improved performance and a familiar file system experience. To take advantage of the hierarchical namespace, you must enable it when creating the Azure Storage account for Azure Data Lake Storage Gen2.

Important

  • When the hierarchical namespace is enabled for an Azure Data Lake Storage Gen2 account, you do not need to create any Blob container through Azure Portal.
  • If you enable hierarchical namespace, there is no interoperability of data or operations between Blob and Data Lake Storage Gen2 REST APIs with the public preview.

Once the hierarchical namespace is enabled for a storage account, set fs.azure.createRemoteFileSystemDuringInitialization to true. In a notebook, you can set this configuration by running the command:

spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")

You can also set spark.hadoop.fs.azure.createRemoteFileSystemDuringInitialization to true in the Spark configuration properties field on the cluster creation page.

Frequently asked questions (FAQ)

Can I create a a mount point for Azure Data Lake Storage Gen2?
Mount points of Azure Data Lake Storage Gen2 are not supported.
Does ABFS supports Shared Access Signature (SAS) token authentication?
SAS token authentication is not supported.
Can I use abfs scheme to access Azure Data Lake Storage Gen2?
Yes. However, we recommend that you use abfss scheme, which uses SSL encrypted access, wherever possible.
While accessing an Azure Data Lake Storage Gen2 account with hierarchical namespace enabled I experienced java.io.FileNotFoundException and the error message mentions FilesystemNotFound.

If the error message includes the following information, it is because your command is trying to access a Blob container created through the Azure Portal.

StatusCode=404
StatusDescription=The specified filesystem does not exist.
ErrorCode=FilesystemNotFound
ErrorMessage=The specified filesystem does not exist.

When hierarchical namespace is enabled, you will not need to create containers through Azure Portal. If you see this issue, delete the Blob container through Azure Portal. After a few minutes, you will be able to access this container. Or, you can change your abfss URI to use a different container as long as this container is not created through Azure Portal.