S3 - Databricks

Step 1: Data location and type

There are two ways in Databricks to read from S3. You can either read data using an IAM Role or read data using Access Keys.

We recommend leveraging IAM Roles in Databricks in order to specify which cluster can access which buckets. Keys can show up in logs and table metadata and are therefore fundamentally insecure. If you do use keys, you'll have to escape the / in yourkeys with %2F.

Now to get started, we'll need to set the location and type of the file. We'll do this using widgets. Widgets allow us to parameterize the exectuion of this entire notebook. First we'll create them, then we'll be able to reference them throughout the notebook.

dbutils.widgets.text("file_location", "s3a:/example/location", "Upload Location")
dbutils.widgets.dropdown("file_type", "csv", ["csv", 'parquet', 'json'])

df = spark.read.format(dbutils.widgets.get("file_type")).option("inferSchema", "true").load(dbutils.widgets.get("file_location"))

display(df.select("EXAMPLE_COLUMN"))

df.createOrReplaceTempView("YOUR_TEMP_VIEW_NAME")

%sql

SELECT EXAMPLE_GROUP, SUM(EXAMPLE_AGG) FROM YOUR_TEMP_VIEW_NAME GROUP BY EXAMPLE_GROUP

df.write.format("parquet").saveAsTable("MY_PERMANENT_TABLE_NAME")

Step 1: Data location and type

Step 2: Reading the data

Step 3: Querying the data

Step 4: (Optional) Create a view or table