Azure storage services

Azure Blob Storage

Data can be read from Azure Blob Storage using the Hadoop FileSystem interface. Beginning with Spark 2.2, the necessary libraries and configuration to enable reading form Blob Storage are included in the Databricks Runtime. Data can be read from public storage accounts without any additional settings. To read data from a private storage account, you need to set the account key in your code:

sc.hadoopConfiguration.set("fs.azure.account.key.{YOUR STORAGE ACCOUNT NAME HERE}.blob.core.windows.net", "{YOUR STORAGE ACCOUNT ACCESS KEY HERE}")

And then use standard Spark and Databricks APIs to read from the storage account:

val df = spark.read.parquet("wasbs://my-container@my-storage-account.blob.core.windows.net/parquet-table")
dbutils.fs.ls("wasbs://my-container@my-storage-account.blob.core.windows.net/some-directory")

Azure Data Lake Store

Databricks Runtime versions based on Spark 2.2 or later are also configured to read from Azure Data Lake Store. The necessary libraries and Hadoop configuration come pre-installed. To read from your data lake store, you can configure Spark to use service credentials with the following snippet:

sc.hadoopConfiguration.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
sc.hadoopConfiguration.set("dfs.adls.oauth2.client.id", "{YOUR SERVICE CLIENT ID}")
sc.hadoopConfiguration.set("dfs.adls.oauth2.credential", "{YOUR SERVICE CREDENTIALS}")
sc.hadoopConfiguration.set("dfs.adls.oauth2.refresh.url", "https://login.windows.net/{YOUR DIRECTORY ID}/oauth/token")

If you do not already have service credentials, you can follow these instructions: Create service principal with portal.

If you do not know your directory ID, you can follow these instructions.

After providing credentials, you can read from Data Lake Store using standard APIs:

val df = spark.read.parquet("adl://my-datalake-account.azuredatalakestore.net/")
dbutils.fs.list("adl://my-datalake-account.azuredatalakestore.net/")

Note that Data Lake Store provides directory level access control, so the service principal must have access to the directories that you want to read from as well as the Data Lake Store resource.

Installing Azure libraries

For Databricks Runtime versions earlier than Spark 2.2, you will need to install Azure data access libraries and configure Spark to use them.

There’s a JAR for Azure connectivity with Hadoop you can use to connect to Azure Blob Storage. In order to use this JAR you will first need to create a new library in Databricks by searching Maven Central for “hadoop-azure” and getting the latest 2.*.* version (this will map to the version of Hadoop you’re using).

Once you’ve added that library, attach it to any cluster running Hadoop 2. Keep in mind the library will pull in a lot of dependencies which may conflict with Apache Spark dependencies, but you can always add exclusions in the advanced options.

Once the library attaches to your cluster you’ll need to configure your Hadoop config and access credentials for your storage account. In a notebook cell, enter the following:

sc.hadoopConfiguration.set("fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc.hadoopConfiguration.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc.hadoopConfiguration.set("fs.AbstractFileSystem.wasb.impl", "org.apache.hadoop.fs.azure.Wasb")
sc.hadoopConfiguration.set("fs.azure.account.key.{YOUR STORAGE ACCOUNT NAME HERE}.blob.core.windows.net", "{YOUR STORAGE ACCOUNT ACCESS KEY HERE}")

Once that’s done you can now reference Azure storage, but you’ll want to use “wasbs” for secure transfer since you’ll be going across the internet. Also keep in mind the performance will probably not be great since again, it’s crossing data centers.

Remember if you’re querying or pulling data out of that storage account you’ll be paying for data transfer fees leaving the Azure region.