Azure Blob Storage

Azure Blob storage is a service for storing large amounts of unstructured object data, such as text or binary data, that can be accessed from anywhere in the world via HTTP or HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. Common uses of Blob storage include:

  • Serving images or documents directly to a browser
  • Storing files for distributed access
  • Streaming video and audio
  • Storing data for backup and restore, disaster recovery, and archiving
  • Storing data for analysis by an on-premises or Azure-hosted service

Note

Databricks Runtime 3.1 and above provide built-in support for Azure Blob storage and Azure Data Lake Store.

This topic explains how to access Azure Blob storage by mounting storage using DBFS or directly using APIs.

Requirements

Data can be read from public storage accounts without any additional settings. To read data from a private storage account, you must configure a Shared Key or a Shared Access Signature (SAS). For leveraging credentials safely in Databricks, we recommend that you follow the Secrets user guide.

Mount Azure Blob storage containers with DBFS

You can mount a Blob storage container or a folder inside a container through Databricks File System - DBFS. This gives all users in the same workspace the ability to access the Blob storage container or the folder inside the container through the mount point.

DBFS uses the credential that you provide when you create the mount point to access the mounted Blob storage container. If a Blob storage container is mounted using a storage account access key, DBFS uses temporary SAS tokens derived from the storage account key when it accesses this mount point.

Mount an Azure Blob storage container

You can mount Blob storage containers using Databricks Runtime 4.0 or higher. Once a Blob storage container is mounted, you can use Runtime 3.4 or higher to access the mount point.

Important

  • You should create a mount point only if you want all users in the Databricks workspace to have access to the mounted Blob storage container.
  • Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, users must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.
  1. To mount a Blob storage container or a folder inside a container, use the following command:

    Scala
    dbutils.fs.mount(
      source = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>",
      mountPoint = "/mnt/<mount-name>",
      extraConfigs = Map("<conf-key>" -> "<conf-value>"))
    
    Python
    dbutils.fs.mount(
      source = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>",
      mount_point = "/mnt/<mount-name>",
      extra_configs = {"<conf-key>": "<conf-value>"})
    

    where

    • <mount-name> is a DBFS path representing where the Blob storage container or a folder inside the container (specified in source) will be mounted in DBFS.
    • <conf-key> and <conf-value> specify the credentials used to access the mount point. <conf-key> can be either fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net or fs.azure.sas.<your-container-name>.<your-storage-account-name>.blob.core.windows.net.
  2. Access files in your container as if they were local files, for example:

    Scala
    // scala
    val df = spark.read.text(s"/mnt/$<mount-name>/...")
    val df = spark.read.text(s"dbfs:/$<mount-name>/...")
    
    Python
    # python
    df = spark.read.text("/mnt/%s/...." % <mount-name>)
    df = spark.read.text("dbfs:/<mount-name>/...")
    

Unmount a mount point

To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

Access Azure Blob storage directly

This section explains how to access Azure Blob storage using Spark DataFrame and RDD APIs.

Access Azure Blob storage using the DataFrame API

You can read data from Azure Blob storage using the Spark API and Databricks APIs:

  • Set up an account access key:

    spark.conf.set(
      "fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
      "<your-storage-account-access-key>")
    
  • Set up a SAS for a given container:

    spark.conf.set(
      "fs.azure.sas.<your-container-name>.<your-storage-account-name>.blob.core.windows.net",
      "<complete-query-string-of-your-sas-for-the-container>")
    

Once an account access key or a SAS is set up in your notebook, you can use standard Spark and Databricks APIs to read from the storage account:

val df = spark.read.parquet("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")

dbutils.fs.ls("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")

Access Azure Blob storage using the RDD API

Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. If you are using the RDD API to read from Azure Blob storage, you must set the credentials using one of the following methods:

  • Specify the Hadoop credential configuration options as Spark options when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to tell Spark to propagate them to the Hadoop configurations that are used for your RDD jobs:

    # Using an account access key
    spark.hadoop.fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net <your-storage-account-access-key>
    
    # Using a SAS token
    spark.hadoop.fs.azure.sas.<your-container-name>.<your-storage-account-name>.blob.core.windows.net <complete-query-string-of-your-sas-for-the-container>
    
  • Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    // Using an account access key
    spark.sparkContext.hadoopConfiguration.set(
      "fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
      "<your-storage-account-access-key>"
    )
    
    // Using a SAS token
    spark.sparkContext.hadoopConfiguration.set(
      "fs.azure.sas.<your-container-name>.<your-storage-account-name>.blob.core.windows.net",
      "<complete-query-string-of-your-sas-for-the-container>"
    )
    

Warning

These credentials are available to all users who access the cluster.