Getting Started

Welcome to Databricks

This page shows you around the Databricks UI and gets you started running code. You can copy and paste the code cells below into Databricks or you can import the Introduction to Apache Spark notebook series to get started.

Tip

To get help at any time, click the question mark button at the top right-hand corner.

Help Menus

Creating a Cluster

At the heart of Databricks sit Apache Spark clusters. In order to execute code (including Spark code) or import some data you’re going to need to create a cluster. Luckily doing so is simple!

To create a cluster Click on the Clusters Icon Clusters Menu Icon from the left side menu. The Clusters page allows you to manage clusters.

../_images/aws-cluster-list.png

Once on the cluster page, click on Create Cluster in the upper left corner. Then enter a name for the cluster along with the configurations for that cluster that you’d like.

Once you’ve created a cluster, you can start executing code. To read more, see the Clusters documentation.

Creating a Notebook

On the left side, click on the Workspace Icon Menu Icon from the main menu to begin. Then click on down arrow on the right side of Workspace, select a folder, and choose Create > Notebook.

Create Notebook

The Create Notebook dialog will appear:

  1. Enter a unique name for your notebook.
  2. For language, click the drop down and choose any language you like.
  3. For cluster, click the drop down and choose the cluster you created in the step above.

Using a Notebook

Now that you’ve created a notebook it’s time to start using it. First you’ll need to attach it to a cluster. You’ll see an option to do this right under the name of the notebook. Now that you created your notebook and attached it to a cluster, you can run some example commands.

To execute the command, type the script below and press Shift+Enter to execute it.

Predefined Variables

In Databricks, notebooks already have some of the most useful Apache Spark variables that you’re going to need.

Tip

Do not create a SparkSession, SparkContext or SQLContext yourself in Databricks. Creating multiple contexts is not supported and can cause inconsistent behavior. Use the existing contexts provided with the notebook.

Description Variable Name
Spark Context sc
SQL Context / Hive Context sqlContext
SparkSession (2.0 Only) spark

Running some Code


Perform addition by typing the command into the cell.

# A Spark Context is already created for you.
# Do not create another or unspecified behavior may occur.
spark
# A SQLContext is also already created for you.
# Do not create another or unspecified behavior may occur.
# As you can see below, the sqlContext provided is a HiveContext.
sqlContext
# A Spark Context is already created for you.
# Do not create another or unspecified behavior may occur.
sc

Now that we’ve seen the pre-defined variables, let’s go ahead and run some real code!

1+1 # => 2

You should be able to see the answer to this immediately!

Create a DataFrame

Now that we’ve executed some simple code, let’s go ahead and create a DataFrames and Datasets. The below code allows you to do this in python.

myDataFrame = sc.parallelize([('a', 1), ('b', 2), ('c', 3)]).toDF()
display(myDataFrame)

Initially the result will be a table; click the Chart Button and choose bar chart and you’ll see that we get a nice visualization of the values and their associated counts. Learn more about Visualizations.

Where to Go Next

We’ve now covered the basics of Databricks, including creating a cluster and using notebooks. To learn more about the platform, watch the videos below:

Getting Started Videos

  1. Introducing Databricks https://vimeo.com/130273206
  2. Cluster Manager and Jobs https://vimeo.com/156886719
  3. Collaboration https://vimeo.com/156886720
  4. Data Exploration https://vimeo.com/137874931
  5. Data Visualization https://vimeo.com/156886721
Introducing Databricks Cluster Manager and Jobs Collaboration Data Exploration Data Visualization
Introducing Databricks Cluster Manager and Jobs Collaboration Data Exploration Data Visualization