DBFS(Python)

Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. DBFS is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

Step 1: File location and type

Of note, this notebook is written in Python so the default cell type is Python. However, you can use different languages by using the %LANGUAGE syntax. Python, Scala, SQL, and R are all supported.

First we'll need to set the location and type of the file. You set the file location when you uploaded the file. We'll do this using widgets. Widgets allow us to parameterize the exectuion of this entire notebook. First we'll create them, then we'll be able to reference them throughout the notebook.

dbutils.widgets.text("file_location", "/uploads/data", "Upload Location")
dbutils.widgets.dropdown("file_type", "csv", ["csv", 'parquet', 'json'])
# this can be csv, parquet, json and or any Other Spark Data source.
# See: https://docs.databricks.com/spark/latest/data-sources/index.html
# for more information.

Step 2: Reading the data

Now that we specified our file metadata, we can create a DataFrame. You'll notice that we use an option to specify that we'd like to infer the schema from the file. We can also explicitly set this to a particular schema if we have one already.

First, let's create a DataFrame in Python, notice how we will programmatically reference the widget values we defined above.

df = spark.read.format(dbutils.widgets.get("file_type")).option("inferSchema", "true").load(dbutils.widgets.get("file_location"))

Step 3: Querying the data

Now that we created our DataFrame. We can query it. For instance, you can select some particular columns to select and display within Databricks.

display(df.select("EXAMPLE_COLUMN"))

Step 4: (Optional) Create a view or table

If you'd like to be able to use query this data as a table, it is simple to register it as a view or a table.

df.createOrReplaceTempView("YOUR_TEMP_VIEW_NAME")

We can query this using Spark SQL. For instance, we can perform a simple aggregation. Notice how we can use %sql in order to query the view from SQL.

%sql

SELECT EXAMPLE_GROUP, SUM(EXAMPLE_AGG) FROM YOUR_TEMP_VIEW_NAME GROUP BY EXAMPLE_GROUP

With this registered as a temp view, it will only be available to this particular notebook. If you'd like other users to be able to query this table. You can also create a table from the DataFrame.

df.write.format("delta").saveAsTable("MY_PERMANENT_TABLE_NAME")

This table will persist across cluster restarts as well as allow various users across different notebooks to query this data.