Azure Data Lake Storage Gen2

Note

Azure Data Lake Storage Gen2 support is in Public Preview.

Note

Databricks Runtime 4.2 and above provide built-in support for Azure Data Lake Storage Gen2.

This topic explains how to access Azure Data Lake Storage Gen2 using the ABFS driver built into Databricks Runtime.

Requirements

An Azure Data Lake Storage Gen2 storage account and the access key for the storage account.

Access Azure Data Lake Storage Gen2 directly

To set up the credential of a Azure Storage account for Azure Data Lake Storage Gen2, we recommend that you set the credential in the session configs of your notebook.

spark.conf.set(
  "fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
  "<your-storage-account-access-key>")

Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. If you are using the RDD API to read from Azure Data Lake Storage Gen2, you must set the credentials using one of the following methods:

  • Specify the Hadoop configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations that are used for your RDD jobs:

    # Using an account access key
    spark.hadoop.fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net <your-storage-account-access-key>
    
  • Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    // Using an account access key
    spark.sparkContext.hadoopConfiguration.set(
      "fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
      "<your-storage-account-access-key>"
    )
    

Warning

The credentials set in the Hadoop configuration are available to all users who access the cluster.

Once an account access key is set up, you can use standard Spark and Databricks APIs to read from the storage account. For example,

val df = spark.read.parquet("abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-name>")

dbutils.fs.ls("abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-name>")

Frequently asked questions (FAQ)

Can I create a a mount point for Azure Data Lake Storage Gen2?
Mount points of Azure Data Lake Storage Gen2 are not supported.
Does ABFS supports Shared Access Signature (SAS) token authentication?
SAS token authentication is not supported.
Can I use abfs scheme to access Azure Data Lake Storage Gen2?
Yes. However, we recommend that you use abfss scheme, which uses SSL encrypted access, wherever possible.