Estimated time to complete: 35 minutes
In this first task, we cover the fundamental information you need to know about how the Databricks Unified Data Analytics Platform works and the types of problems it solves.
Next is a tour of the Databricks workspace, which is the online web app that you and your colleagues will work in to access Databricks functionality. Before you get started, please have your credentials ready so that you can log into your Databricks workspace and follow along.
Now that you’ve toured the workspace, let’s review one component of the Workspace in greater detail: Databricks Notebooks.
Databricks notebooks are the primary space where data practitioners perform their daily work. In this video, we review notebook basics: how to access them, how to use them, and how to manage them.
Now it’s your turn to practice performing basic tasks in the Databricks workspace. Use your credentials to log into your Databricks workspace and follow this guide to:
- Create a cluster
- Create a notebook
- Create a table
- Query the table
- Display data
- Schedule a job
As a supplement to this guide, check out the Quickstart Tutorial notebook, available on your Databricks workspace landing page, for a 5-minute hands-on introduction to Databricks. Simply log into your Databricks workspace and click Explore the Quickstart Tutorial.
A cluster is a collection of Databricks computation resources. To create a cluster:
In the sidebar, click Compute.
On the Compute page, click Create Cluster.
On the Create Cluster page, enter the cluster name
Quickstartand select 7.3 LTS (Scala 2.12, Spark 3.0.1) in the Databricks Runtime Version drop-down.
Click Create Cluster.
A notebook is a collection of cells that run computations on a Databricks Runtime cluster. To create a notebook in the workspace:
In the sidebar, click Workspace.
In the Workspace folder, select Create > Notebook.
On the Create Notebook dialog, enter a name and select SQL in the Language drop-down. This selection determines the default language of the notebook.
Click Create. The notebook opens with an empty cell at the top.
Create a Spark table using data from a sample CSV data file available in Databricks datasets, a collection of datasets mounted to Databricks File System (DBFS), a distributed file system installed on Databricks clusters.
Copy and paste this code snippet into a notebook cell:
DROP TABLE IF EXISTS diamonds; CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
Run a SQL statement to query the table for the average diamond price by color.
To add a cell to the notebook, mouse over the cell bottom and click the icon.
Copy this snippet and paste it in the cell.
SELECT color, avg(price) AS price FROM diamonds GROUP BY color ORDER BY COLOR
Press SHIFT + ENTER. The notebook displays a table of diamond color and average price.
Display a chart of the average diamond price by color.
Click the Bar chart icon .
Click Plot Options.
Drag color into the Keys box.
Drag price into the Values box.
In the Aggregation drop-down, select AVG.
Click Apply to display the bar chart.
- In the sidebar, click Jobs.
- Enter a name in the text field to replace the placeholder text Untitled.
- Next to Task, click Select Notebook.
- Navigate to your user folder and select the notebook you were just working in.
- Next to Schedule, click Edit.
- Add the details for how often you want this job to run.