Azure Data Lake Storage Gen1

Azure Data Lake Store is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake Store enables you to capture data of any size, type, and ingestion speed in a single place for operational and exploratory analytics. Azure Data Lake Store is specifically designed to enable analytics on the stored data and is tuned for performance for data analytics scenarios.

This topic explains how to access Azure Data Lake Store by mounting Azure Data Lake Store using DBFS or directly using APIs.

Requirements

  • Azure access credentials. If you do not already have service credentials, you can follow the instructions in Create service principal with portal. If you do not know your-directory-id (also referred to as tenant ID in Azure), you can follow the instructions in Get tenant ID. For leveraging credentials safely in Databricks, we recommend that you follow the Secrets user guide.

Mount Azure Data Lake Store with DBFS

In addition to accessing Azure Data Lake Store directly, you can also mount a Data Lake Store or a folder inside it through Databricks File System - DBFS. The mount is a pointer to a Data Lake Store, so the data is never synced locally.

Important

  • You should create a mount point only if you want all users in the Databricks workspace to have access to the mounted Data Lake Store. The service client that you use to access the Data Lake Store should be granted access only to that Data Lake Store; it should not be granted access to other resources in Azure.
  • Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, users must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

DBFS uses the credential you provide when you create the mount point to access the mounted Azure Data Lake Store.

Mount a Data Lake Store

You can mount Data Lake Store using Databricks Runtime 4.0 or higher. Once a Data Lake Store is mounted, you can use Runtime 3.4 or higher to access the mount point.

  1. To mount a Data Lake Store or a folder inside it, use the following command:

    Scala
    val configs = Map(
      "dfs.adls.oauth2.access.token.provider.type" -> "ClientCredential",
      "dfs.adls.oauth2.client.id" -> "<your-service-client-id>",
      "dfs.adls.oauth2.credential" -> "<your-service-credentials>",
      "dfs.adls.oauth2.refresh.url" -> "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")
    
    dbutils.fs.mount(
      source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>",
      mountPoint = "/mnt/<mount-name>",
      extraConfigs = configs)
    
    Python
    configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
               "dfs.adls.oauth2.client.id": "<your-service-client-id>",
               "dfs.adls.oauth2.credential": "<your-service-credentials>",
               "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}
    
    dbutils.fs.mount(
      source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>",
      mount_point = "/mnt/<mount-name>",
      extra_configs = configs)
    

    where <mount-name> is a DBFS path representing where the Data Lake Store or a folder inside it (specified in source) will be mounted in DBFS.

  2. Access files in your container as if they were local files, for example:

    Scala
    val df = spark.read.text("/mnt/<mount-name>/....")
    val df = spark.read.text("dbfs:/<mount-name>/....")
    
    Python
    df = spark.read.text("/mnt/%s/...." % <mount-name>)
    df = spark.read.text("dbfs:/<mount-name>/....")
    

Unmount a mount point

To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

Access Azure Data Lake Store directly

This section explains how to access Azure Data Lake Store using the Spark DataFrame and RDD APIs.

Access Azure Data Lake Store using the DataFrame API

To read from your Data Lake Store account, you can configure Spark to use service credentials with the following snippet in your notebook:

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "<your-service-client-id>")
spark.conf.set("dfs.adls.oauth2.credential", "<your-service-credentials>")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")

After providing credentials, you can read from Data Lake Store using Spark and Databricks APIs:

val df = spark.read.parquet("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")

dbutils.fs.ls("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")

The Data Lake Store provides directory level access control, so the service principal must have access to the directories that you want to read from as well as the Data Lake Store resource.

Access Azure Data Lake Store using the RDD API

Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. If you are using the RDD API to read from Azure Data Lake Store, you must set the credentials using one of the following methods:

  • Specify the Hadoop credential configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations used for your RDD jobs:

    spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
    spark.hadoop.dfs.adls.oauth2.client.id <your-service-client-id>
    spark.hadoop.dfs.adls.oauth2.credential <your-service-credentials>
    spark.hadoop.dfs.adls.oauth2.refresh.url "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"
    
  • Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.client.id", "<your-service-client-id>")
    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.credential", "<your-service-credentials>")
    spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")
    

Warning

These credentials are available to all users who access the cluster.

Access Azure Data Lake Store through metastore

To access adl:// locations specified in the metastore, you must specify Hadoop credential configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations used by the metastore:

spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
spark.hadoop.dfs.adls.oauth2.client.id <your-service-client-id>
spark.hadoop.dfs.adls.oauth2.credential <your-service-credentials>
spark.hadoop.dfs.adls.oauth2.refresh.url "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"

Warning

These credentials are available to all users who access the cluster.