Tutorial: Query data with notebooks
This tutorial walks you through using the Databricks notebooks user interface to create a cluster and a notebook, create a table from a dataset, query the table, and display the query results.
Tip
You can also use the Databricks Terraform provider to create this article’s resources. See Create clusters, notebooks, and jobs with Terraform.
Requirements
You are logged into a Databricks workspace.
You have permission to create a cluster.
Note
If you do not have cluster control privileges, you can still complete most of the steps below as long as you have access to a cluster.
From the left sidebar on the landing page, you access fundamental workspace entities: the Workspace, Catalog, Workflows, and Compute. The workspace is the special root folder that stores your Databricks assets, such as notebooks and libraries.
For guidance about how to navigate a Databricks notebook, see Databricks notebook interface and controls.
Step 1: Create a cluster
A cluster is a collection of Databricks computation resources. To create a cluster:
In the sidebar, click
Compute.
On the Compute page, click Create Compute.
On the New Compute page, select 12.2 LTS (Scala 2.12, Spark 3.3.2) or higher from the Databricks Runtime version dropdown.
Click Create Cluster.
Step 2: Create a notebook
A notebook is a collection of cells that run computations on an Apache Spark cluster. For more information on using notebooks, see Introduction to Databricks notebooks. To create a notebook in the workspace:
In the sidebar, click
Workspace.
In your Home
folder, click the blue
Add button > Notebook.
Replace the default name of your notebook with your own title and select SQL in the language drop-down. This selection determines the default language of the notebook.
Attach the notebook to the cluster you created. Click the cluster selector in the notebook toolbar and select your cluster from the dropdown menu. If you don’t see your cluster, click More… and select the cluster from the dropdown menu in the dialog.
Step 3: Create a table
Create a table using data from a sample CSV data file available in Sample datasets, a collection of datasets mounted to What is the Databricks File System (DBFS)?, a distributed file system installed on Databricks clusters. You have two options for creating the table.
Option 1: Create a Spark table from the CSV data
Use this option if you want to get going quickly, and you only need standard levels of performance. Copy and paste this code snippet into a notebook cell:
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
Option 2: Write the CSV data to Delta Lake format and create a Delta table
Delta Lake offers a powerful transactional storage layer that enables fast reads and other benefits. Delta Lake format consists of Parquet files plus a transaction log. Use this option to get the best performance on future operations on the table.
Read the CSV data into a DataFrame and write out in Delta Lake format. This command uses a Python language magic command, which allows you to interleave commands in languages other than the notebook default language (SQL). Copy and paste this code snippet into a notebook cell:
%python diamonds = (spark.read .format("csv") .option("header", "true") .option("inferSchema", "true") .load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv") ) diamonds.write.format("delta").mode("overwrite").save("/mnt/delta/diamonds")
Create a Delta table at the stored location. Copy and paste this code snippet into a notebook cell:
DROP TABLE IF EXISTS diamonds; CREATE TABLE diamonds USING DELTA LOCATION '/mnt/delta/diamonds/'
Run cells by pressing SHIFT + ENTER. The notebook automatically attaches to the cluster you created in Step 2 and runs the command in the cell.
Step 4: Query the table
Run a SQL statement to query the table for the average diamond price by color.
To add a cell to the notebook, mouse over the cell bottom and click the
icon.
Copy this snippet and paste it in the cell.
SELECT color, avg(price) AS price FROM diamonds GROUP BY color ORDER BY COLOR
Press SHIFT + ENTER. The notebook displays a table of diamond color and average price.
Step 5: Display the data
Display a chart of the average diamond price by color.
Next to the Table tab, click + and then click Visualization.
The visualization editor displays.
In the Visualization Type drop-down, verify that Bar is selected.
Clear the Horizontal chart checkbox.
Change the aggregation type for the y columns from Sum to Average.
Click Save.
Next steps
To learn more about the primary tools you use and tasks you can perform with Databricks Data Science & Engineering workspace, see: