Azure Storage Example

%md

This notebook shows you how to create and query a table or DataFrame loaded from data stored in Azure Blob storage.

This notebook shows you how to create and query a table or DataFrame loaded from data stored in Azure Blob storage.

Last refresh: Never

%md

### Step 1: Set the data location and type

There are two ways to access Azure Blob storage: account keys and shared access signatures (SAS).

To get started, we need to set the location and type of the file.

Step 1: Set the data location and type

There are two ways to access Azure Blob storage: account keys and shared access signatures (SAS).

To get started, we need to set the location and type of the file.

Last refresh: Never

storage_account_name = "STORAGE_ACCOUNT_NAME"
storage_account_access_key = "YOUR_ACCESS_KEY"

file_location = "wasbs://example/location"
file_type = "csv"

spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

%md

### Step 2: Read the data

Now that we have specified our file metadata, we can create a DataFrame. Notice that we use an *option* to specify that we want to infer the schema from the file. We can also explicitly set this to a particular schema if we have one already.

First, let's create a DataFrame in Python.

Step 2: Read the data

Now that we have specified our file metadata, we can create a DataFrame. Notice that we use an option to specify that we want to infer the schema from the file. We can also explicitly set this to a particular schema if we have one already.

First, let's create a DataFrame in Python.

Last refresh: Never

df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)

%md

### Step 3: Query the data

Now that we have created our DataFrame, we can query it. For instance, you can identify particular columns to select and display.

Step 3: Query the data

Now that we have created our DataFrame, we can query it. For instance, you can identify particular columns to select and display.

Last refresh: Never

display(df.select("EXAMPLE_COLUMN"))

%md

### Step 4: (Optional) Create a view or table

If you want to query this data as a table, you can simply register it as a *view* or a table.

Step 4: (Optional) Create a view or table

If you want to query this data as a table, you can simply register it as a view or a table.

Last refresh: Never

df.createOrReplaceTempView("YOUR_TEMP_VIEW_NAME")

%md

We can query this view using Spark SQL. For instance, we can perform a simple aggregation. Notice how we can use `%sql` to query the view from SQL.

We can query this view using Spark SQL. For instance, we can perform a simple aggregation. Notice how we can use %sql to query the view from SQL.

Last refresh: Never

%sql

SELECT EXAMPLE_GROUP, SUM(EXAMPLE_AGG) FROM YOUR_TEMP_VIEW_NAME GROUP BY EXAMPLE_GROUP

%md

Since this table is registered as a temp view, it will be available only to this notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.

Since this table is registered as a temp view, it will be available only to this notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.

Last refresh: Never

df.write.format("parquet").saveAsTable("MY_PERMANENT_TABLE_NAME")

%md

This table will persist across cluster restarts and allow various users across different notebooks to query this data.

This table will persist across cluster restarts and allow various users across different notebooks to query this data.

Last refresh: Never