Azure storage services

Note

Databricks Runtime provides built-in support for accessing Azure Blob Storage and Azure Data Lake Store starting from Databricks Runtime 3.1. Please use the latest Databricks Runtime version to take advantage of new improvements and bug fixes.

Azure Blob Storage

Data can be read from Azure Blob Storage using the Hadoop FileSystem interface. Data can be read from public storage accounts without any additional settings. To read data from a private storage account, you need to set an account key or a Shared Access Signature (SAS) in your notebook.

  • Setting up an account key:

    spark.conf.set(
      "fs.azure.account.key.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net",
      "{YOUR STORAGE ACCOUNT ACCESS KEY}")
    
  • Setting up a SAS for a given container:

    spark.conf.set(
      "fs.azure.sas.{YOUR CONTAINER NAME}.{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net",
      "{YOUR SAS FOR THE GIVEN CONTAINER}")
    

Once an account key or a SAS is setup in your notbeook, you can use standard Spark and Databricks APIs to read from the storage account:

val df = spark.read.parquet("wasbs://{YOUR CONTAINER NAME}@{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net/{YOUR DIRECTORY NAME}")
dbutils.fs.ls("wasbs://{YOUR CONTAINER NAME}@{YOUR STORAGE ACCOUNT NAME}.blob.core.windows.net/{YOUR DIRECTORY NAME}")

Azure Data Lake Store

To read from your Data Lake Store account, you can configure Spark to use service credentials with the following snippet in your notebook:

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "{YOUR SERVICE CLIENT ID}")
spark.conf.set("dfs.adls.oauth2.credential", "{YOUR SERVICE CREDENTIALS}")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.windows.net/{YOUR DIRECTORY ID}/oauth2/token")

If you do not already have service credentials, you can follow these instructions: Create service principal with portal. If you do not know your directory ID, you can follow these instructions.

After providing credentials, you can read from Data Lake Store using standard APIs:

val df = spark.read.parquet("adl://{YOUR DATA LAKE STORE ACCOUNT NAME}.azuredatalakestore.net/{YOUR DIRECTORY NAME}")
dbutils.fs.ls("adl://{YOUR DATA LAKE STORE ACCOUNT NAME}.azuredatalakestore.net/{YOUR DIRECTORY NAME}")

Note that Data Lake Store provides directory level access control, so the service principal must have access to the directories that you want to read from as well as the Data Lake Store. resource.