Azure Data Lake Store Example(Scala)

This example notebook closely follows the Databricks documentation for how to set up Azure Data Lake Store as a data source in Databricks.

0 - Setup

To get set up, do these tasks first:

  • Get service credentials: Client ID <aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee> and Client Credential <NzQzY2QzYTAtM2I3Zi00NzFmLWI3MGMtMzc4MzRjZmk=>. Follow the instructions in Create service principal with portal.
  • Get directory ID <ffffffff-gggg-hhhh-iiii-jjjjjjjjjjjj>: This is also referred to as tenant ID. Follow the instructions in Get tenant ID.
  • If you haven't set up the service app, follow this tutorial. Set access at the root directory or desired folder level to the service or everyone.

There are two options to read and write Azure Data Lake data from Azure Databricks:

  1. DBFS mount points
  2. Spark configs

1 - DBFS mount points

DBFS mount points let you mount Azure Data Lake Store for all users in the workspace. Once it is mounted, the data can be accessed directly via a DBFS path from all clusters, without the need for providing credentials every time. The example below shows how to set up a mount point for Azure Data Lake Store.

val configs = Map(
  "dfs.adls.oauth2.access.token.provider.type" -> "ClientCredential",
  "dfs.adls.oauth2.client.id" -> "<aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee>",
  "dfs.adls.oauth2.credential" -> "<NzQzY2QzYTAtM2I3Zi00NzFmLWI3MGMtMzc4MzRjZmk=>",
  "dfs.adls.oauth2.refresh.url" -> "https://login.microsoftonline.com/<ffffffff-gggg-hhhh-iiii-jjjjjjjjjjjj>/oauth2/token")

dbutils.fs.mount(
  source = "adl://kpadls.azuredatalakestore.net/",
  mountPoint = "/mnt/kp-adls",
  extraConfigs = configs)
%fs ls /mnt/kp-adls-testing

2 - Spark Configs

With Spark configs, the Azure Data Lake Store settings can be specified per notebook. To keep things simple, the example below includes the credentials in plaintext. However, we strongly discourage you from storing secrets in plaintext. Instead, we recommend storing the credentials as Databricks Secrets.

Note: spark.conf values are visible only to the DataSet and DataFrames API. If you need access to them from an RDD, refer to the documentation.

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "<aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee>")
spark.conf.set("dfs.adls.oauth2.credential", "<NzQzY2QzYTAtM2I3Zi00NzFmLWI3MGMtMzc4MzRjZmk=>")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<ffffffff-gggg-hhhh-iiii-jjjjjjjjjjjj>/oauth2/token")
%fs ls adl://kpadls.azuredatalakestore.net/testing/
spark.read.parquet("dbfs:/mnt/my-datasets/datasets/iot/events").write.mode("overwrite").parquet("adl://kpadls.azuredatalakestore.net/testing/tmp/kp/v1")
%fs ls adl://kpadls.azuredatalakestore.net/testing/tmp/kp/v1