Azure Storage Example(Python)

This notebook shows you how to create and query a table or DataFrame loaded from data stored in Azure Blob storage.

Step 1: Set the data location and type

There are two ways to access Azure Blob storage: account keys and shared access signatures (SAS).

To get started, we need to set the location and type of the file.

storage_account_name = "STORAGE_ACCOUNT_NAME"
storage_account_access_key = "YOUR_ACCESS_KEY"
file_location = "wasbs://example/location"
file_type = "csv"
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

Step 2: Read the data

Now that we have specified our file metadata, we can create a DataFrame. Notice that we use an option to specify that we want to infer the schema from the file. We can also explicitly set this to a particular schema if we have one already.

First, let's create a DataFrame in Python.

df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)

Step 3: Query the data

Now that we have created our DataFrame, we can query it. For instance, you can identify particular columns to select and display.

display(df.select("EXAMPLE_COLUMN"))

Step 4: (Optional) Create a view or table

If you want to query this data as a table, you can simply register it as a view or a table.

df.createOrReplaceTempView("YOUR_TEMP_VIEW_NAME")

We can query this view using Spark SQL. For instance, we can perform a simple aggregation. Notice how we can use %sql to query the view from SQL.

%sql

SELECT EXAMPLE_GROUP, SUM(EXAMPLE_AGG) FROM YOUR_TEMP_VIEW_NAME GROUP BY EXAMPLE_GROUP

Since this table is registered as a temp view, it will be available only to this notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.

df.write.format("parquet").saveAsTable("MY_PERMANENT_TABLE_NAME")

This table will persist across cluster restarts and allow various users across different notebooks to query this data.