Azure Blob storage

Azure Blob storage is a service for storing large amounts of unstructured object data, such as text or binary data. You can use Blob storage to expose data publicly to the world, or to store application data privately. Common uses of Blob storage include:

  • Serving images or documents directly to a browser
  • Storing files for distributed access
  • Streaming video and audio
  • Storing data for backup and restore, disaster recovery, and archiving
  • Storing data for analysis by an on-premises or Azure-hosted service

This article explains how to access Azure Blob storage by mounting storage using the Databricks File System (DBFS) or directly using APIs.

Requirements

You can read data from public storage accounts without any additional settings. To read data from a private storage account, you must configure a Shared Key or a Shared Access Signature (SAS).

For leveraging credentials safely in Databricks, we recommend that you follow the Secret management user guide as shown in Mount an Azure Blob storage container.

Mount Azure Blob storage containers to DBFS

You can mount a Blob storage container or a folder inside a container to DBFS. The mount is a pointer to a Blob storage container, so the data is never synced locally.

Important

  • Azure Blob storage supports three blob types: block, append, and page. You can only mount block blobs to DBFS.
  • All users have read and write access to the objects in Blob storage containers mounted to DBFS.
  • Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

DBFS uses the credential that you provide when you create the mount point to access the mounted Blob storage container. If a Blob storage container is mounted using a storage account access key, DBFS uses temporary SAS tokens derived from the storage account key when it accesses this mount point.

Mount an Azure Blob storage container

  1. To mount a Blob storage container or a folder inside a container, use the following command:

    dbutils.fs.mount(
      source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
      mount_point = "/mnt/<mount-name>",
      extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
    
    dbutils.fs.mount(
      source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>",
      mountPoint = "/mnt/<mount-name>",
      extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))
    

    where

    • <storage-account-name> is the name of your Azure Blob storage account.
    • <container-name> is the name of a container in your Azure Blob storage account.
    • <mount-name> is a DBFS path representing where the Blob storage container or a folder inside the container (specified in source) will be mounted in DBFS.
    • <conf-key> can be either fs.azure.account.key.<storage-account-name>.blob.core.windows.net or fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net
    • dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") gets the key that has been stored as a secret in a secret scope.
  2. Access files in your container as if they were local files, for example:

    # python
    df = spark.read.text("/mnt/<mount-name>/...")
    df = spark.read.text("dbfs:/<mount-name>/...")
    
    // scala
    val df = spark.read.text("/mnt/<mount-name>/...")
    val df = spark.read.text("dbfs:/<mount-name>/...")
    
    -- SQL
    CREATE DATABASE <db-name>
    LOCATION "/mnt/<mount-name>"
    

Unmount a mount point

To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

Access Azure Blob storage directly

This section explains how to access Azure Blob storage using the Spark DataFrame API, the RDD API, and the Hive client.

Access Azure Blob storage using the DataFrame API

You need to configure credentials before you can access data in Azure Blob storage, either as session credentials or cluster credentials.

Run the following in a notebook to configure session credentials:

  • Set up an account access key:

    spark.conf.set(
      "fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
      "<storage-account-access-key>")
    
  • Set up a SAS for a container:

    spark.conf.set(
      "fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net",
      "<complete-query-string-of-sas-for-the-container>")
    

To configure cluster credentials, set Spark configuration properties when you create the cluster:

  • Configure an account access key:

    fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>
    
  • Configure a SAS for a container:

    fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net <complete-query-string-of-sas-for-the-container>
    

Warning

These credentials are available to all users who access the cluster.

Once an account access key or a SAS is set up in your notebook or cluster configuration, you can use standard Spark and Databricks APIs to read from the storage account:

val df = spark.read.parquet("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>")

dbutils.fs.ls("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>")

Access Azure Blob storage using the RDD API

Hadoop configuration options are not accessible via SparkContext. If you are using the RDD API to read from Azure Blob storage, you must set the Hadoop credential configuration properties as Spark configuration options when you create the cluster, adding the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations that are used for your RDD jobs:

  • Configure an account access key:

    spark.hadoop.fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>
    
  • Configure a SAS for a container:

    spark.hadoop.fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net <complete-query-string-of-sas-for-the-container>
    

Warning

These credentials are available to all users who access the cluster.

Access Azure Blob storage from the Hive client

Credentials set in a notebook’s session configuration are not accessible to the Hive client. To propagate the credentials to the Hive client, you must set Hadoop credential configuration properties as Spark configuration options when you create the cluster:

  • Configure an account access key:

    spark.hadoop.fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>
    
  • Configure a SAS for a container:

    # Using a SAS token
    spark.hadoop.fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net <complete-query-string-of-sas-for-the-container>
    

Warning

These credentials are available to all users who access the cluster.

Once an account access key or a SAS is set up in your cluster configuration, you can use standard Hive queries with Azure Blob storage:

-- SQL
CREATE DATABASE <db-name>
LOCATION "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/";

The following notebook demonstrates mounting Azure Blob storage and accessing data through Spark APIs, Databricks APIs, and Hive.

Azure Blob storage notebook

Open notebook in new tab