Azure Data Lake Store

Note

Databricks Runtime 3.1 and above provide built-in support for Azure Blob Storage and Azure Data Lake Store.

Accessing Azure Data Lake Stores Directly

To read from your Data Lake Store account, you can configure Spark to use service credentials with the following snippet in your notebook:

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "{YOUR SERVICE CLIENT ID}")
spark.conf.set("dfs.adls.oauth2.credential", "{YOUR SERVICE CREDENTIALS}")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token")

If you do not already have service credentials, you can follow these instructions: Create service principal with portal. If you do not know your directory ID, you can follow these instructions.

After providing credentials, you can read from Data Lake Store using standard APIs:

val df = spark.read.parquet("adl://{YOUR DATA LAKE STORE ACCOUNT NAME}.azuredatalakestore.net/{YOUR DIRECTORY NAME}")
dbutils.fs.ls("adl://{YOUR DATA LAKE STORE ACCOUNT NAME}.azuredatalakestore.net/{YOUR DIRECTORY NAME}")

Note that Data Lake Store provides directory level access control, so the service principal must have access to the directories that you want to read from as well as the Data Lake Store resource.

Note

Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. This means that, while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. If you are using the RDD API to read from Azure Data Lake Store, you must set the credentials using one of the following methods:

  • Specify the Hadoop credential configuration options as Spark options when you create the cluster.

    You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to tell Spark to propagate them to the Hadoop configurations that are used for your RDD jobs:

    spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
    spark.hadoop.dfs.adls.oauth2.client.id {YOUR SERVICE CLIENT ID}
    spark.hadoop.dfs.adls.oauth2.credential {YOUR SERVICE CREDENTIALS}
    spark.hadoop.dfs.adls.oauth2.refresh.url "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token"
    
  • For Scala users, you can also set the credentials into spark.sparkContext.hadoopConfiguration:

    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.client.id", "{YOUR SERVICE CLIENT ID}")
    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.credential", "{YOUR SERVICE CREDENTIALS}")
    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token")
    

Warning! In either case, the credentials you set here are available to all notebooks and JDBC connections.

Mounting Azure Data Lake Stores with DBFS

In addition to accessing Azure Data Lake Stores directly, you can also mount a Data Lake Store or a folder inside it through Databricks File System - DBFS. This gives all users in the same workspace the ability to access the Data Lake Store or the folder inside it through the mount point. DBFS uses the credential you provide when you create the mount point to access the mounted Data Lake Store.

Warning

You should only create a mount point if you want all users in the Databricks workspace to have access to the mounted Data Lake Store. The service client that you use to access the Data Lake Store should be granted access only to that Data Lake Store; it should not be granted access to other resources in Azure.

Note

You can mount Data Lake Stores using Databricks Runtime 4.0 or higher. Once a Data Lake Store is mounted, you can use Runtime 3.4 or higher to access the mount point.

Note

Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, users must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

To mount a Data Lake Store or a folder inside it, you can use the following command:

  • Scala version

    val configs = Map(
      "dfs.adls.oauth2.access.token.provider.type" -> "ClientCredential",
      "dfs.adls.oauth2.client.id" -> "{YOUR SERVICE CLIENT ID}",
      "dfs.adls.oauth2.credential" -> "{YOUR SERVICE CREDENTIALS}",
      "dfs.adls.oauth2.refresh.url" -> "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token")
    dbutils.fs.mount(
      source = "adl://{YOUR DATA LAKE STORE ACCOUNT NAME}.azuredatalakestore.net/{YOUR DIRECTORY NAME}",
      mountPoint = "{mountPointPath}",
      extraConfigs = configs)
    
  • Python version

    configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
               "dfs.adls.oauth2.client.id": "{YOUR SERVICE CLIENT ID}",
               "dfs.adls.oauth2.credential": "{YOUR SERVICE CREDENTIALS}",
               "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token"}
    dbutils.fs.mount(
      source = "adl://{YOUR DATA LAKE STORE ACCOUNT NAME}.azuredatalakestore.net/{YOUR DIRECTORY NAME}",
      mount_point = "{mountPointPath}",
      extra_configs = configs)
    

where

  • {mountPointPath} is a DBFS path representing where the Data Lake Store or a folder inside it (specified in source) will be mounted in DBFS. Note that this path must be under /mnt.

Unmounting

To unmount a mount point, use the following command:

dbutils.fs.unmount("{mountPointPath}")