Notebooks

Overview

Notebooks are one interface for interacting with Databricks. If you have enabled the Databricks Operational Security Package, you may use Managing Access Control to control sharing of notebooks and folders in the workspace.

Creating a Notebook

Creating a notebook in Databricks is simple. First on the left side press the workspace button Workspace Icon or the home icon. From there, next to any notebook in Databricks click the Menu Icon on the right side of the text and choose Create > Notebook. As seen below, it is not required that you create it from the root workspace. You can do this within other folders as well.

Create Notebook

A dialog will appear and you’ll be able to enter a name as well as choose the notebook’s primary language. Notebooks support: Python, Scala, SQL, or R as their primary language.

Importing Notebooks

Importing a notebook is easy. Depending on whether or not you’re importing from a URL or from a file you’ll need to follow the same basic steps.

In your Databricks workspace. Click on the workspace button on the left and under any folder hierarchy select the carat at the top. Then select import from the dropdown menu.

../../_images/import-notebook-databricks.gif

Using Notebooks

Now that you’ve created a notebook, it’s time to start using it. You’ll first need to attach your notebook to a cluster and can do so by clicking “Detached” under the notebook’s name at the top left. From the dropdown, select the cluster you’d like to attach to or create a new cluster.

Now that you’ve attached your notebook to a cluster, you can run some Spark code!

To execute the command, type the script below and press Shift+Enter to execute it.

To add cells, press the + icon or access the notebook cell menu at the far right by clicking the v icon.

Tip

Keyboard shortcuts shortcuts make it much easier to use notebooks and execute code. These are available at the top right under the ? menu.

Predefined Variables

In Databricks, notebooks already have some of the most useful Apache Spark variables that you’re going to need. Do not create a SparkSession, SparkContext or SQLContext yourself in Databricks. Doing so will lead to inconsistent behavior. Use the existing contexts provided within the notebook (and cluster). Here are the already pre-defined variables for you.

Description Variable Name
Spark Context sc
SQL Context / Hive Context sqlContext
SparkSession (2.0 Only) spark

Running Code

In order to run code in a notebook, type the code you would like to execute in a cell and either click the “>” at the top right of the cell or press shift+Enter. This will execute this code cell. For example, try executing the below python code.

# A Spark Context is already created for you.
# Do not create another or unspecified behavior may occur.
spark
# A SQLContext is also already created for you.
# Do not create another or unspecified behavior may occur.
# As you can see below, the sqlContext provided is a HiveContext.
sqlContext
# A Spark Context is already created for you.
# Do not create another or unspecified behavior may occur.
sc

Now that we’ve seen the pre-defined variables, let’s go ahead and run some real code!

1+1 # => 2

Mixing Languages in a Notebook

While a notebook has a default language, in Databricks you can mix languages by using the language magic command.

For example, given a notebook you can execute code in any of our other supported languages by running any of the below by specifying the below string at the beginning of a cell.

  • %python - This allows you to execute python code in a notebook (even if that notebook is not python).
  • %sql - This allows you to execute sql code in a notebook (even if that notebook is not sql).
  • %r - This allows you to execute r code in a notebook (even if that notebook is not r).
  • %scala - This allows you to execute scala code in a notebook (even if that notebook is not scala).
  • %sh - This allows you to execute shell code in your notebook. Add the -e option in order to fail this cell (and subsequently a job or a run all command) if the shell command does not success. By default, %sh alone will not fail a job even if the %sh command does not completely succeed. Only %sh -e will fail if the shell command has a non-zero exit status.
  • %fs - This allows you to use Databricks Utilities - dbutils filesystem commands. Read more on the Databricks File System - DBFS and Databricks Utilities - dbutils pages.

Markdown and HTML in Notebooks

Another option is to include rendered markdown in your notebooks via the %md magic command.

for example the below code will render as a markdown title.

%md # Hello This is a Title

You can link to other notebooks or folders in markdown cells using relative paths. Specify the href attribute of an anchor tag as the relative path, starting with a $ and then following the same pattern as in Linux/Unix file systems:

%md
<a href="$./myNotebook">Link to notebook in same folder as current notebook</a>
<a href="$../myFolder">Link to folder in parent folder of current notebook</a>
<a href="$./myFolder2/myNotebook2">Link to nested notebook</a>

Lastly you can also include raw HTML in your notebooks by using the function displayHTML. Please check out the HTML, D3 & SVG notebook for an example of how to do this.

Line and Command Numbers in Notebooks

To show line numbers or command numbers in your notebook, click View -> ‘Show line numbers’ or View -> ‘Show command numbers’. Once they’re shown, you can hide them again from the same menu. You can also enable line numbers with the keyboard shortcut Control + L.

Show line or command numbers via the view menu
Line and command numbers enabled in notebook

If you enable line or command numbers, it saves your preference and will show them in all of your other notebooks for that browser.

Command numbers above cells link to that specific command. So if you click on the command number for a cell, it will update your url to be anchored to that command. If you want to link to a specific command in your notebook, simply right-click the command number and choose “copy link address”.

Python and Scala error highlighting

In Python and Scala notebooks, we have error highlighting in your code. That is, the actual line of code that is throwing the error will be highlighted in the cell. Additionally, if the error output is a stacktrace, the cell in which the error is thrown will be displayed in the stacktrace as a link to the cell. You can click on this link to jump to the offending code.

../../_images/notebook-python-error-highlighting.png ../../_images/notebook-scala-error-highlighting.png

Notebook Find and Replace

You can access the find and replace tool through the file dropdown.

../../_images/find-replace-in-dropdown.png

You can replace matches on an individual basis by clicking Replace.

The current match is highlighted in orange and all other matches are highlighted in yellow.

You can switch between matches by clicking the Prev and Next buttons or pressing shift+enter and enter to go to the previous and next matches, respectively.

../../_images/find-replace-example.png

Close the find and replace tool by clicking the x button or pressing esc.

Downloading Results

Once you’ve run your code, you may want to download those results to your local machine. The simplest way is to click the download all results button at the bottom of a cell that contains tabular output. You’ll see an option to download the preview of the results or the full results.

You can try this out by running

%sql SELECT 1

and downloading the results.

Running a Notebook from Another Notebook

You can run a notebook from another notebook by using %run. This is roughly equivalent to a :load command in a scala repl on your local machine or an import statement in python. All variables defined in that other notebook with become available in your current notebook.

For example, given notebook A and notebook B.

NotebookA contains 1 cell that has the following python code:

x = 5

Running the code below in notebook B will work even though x was never explicitly created.

%run /Users/path/to/notebookA
print(x) # => 5

If you would like to specify a relative path, you need to preface it with ./ or ../. For example, if Notebook A and Notebook B are in the same directory then you can alternatively run them from a relative path.

%run ./notebookA
print(x) # => 5
%run ../someDirectory/notebookA # up a directory and into another
print(x) # => 5

Exporting and Publishing Notebooks

You can also export notebooks from Databricks via the file menu. If you’re on Community Edition, you can also publish a notebook so that you can share a URL. Any subsequent “publish” actions will update the notebook at that URL.

Notebook Notifications

Notebook notifications alert you to certain events, such as which command is currently running during run all and which commands are in error state. When your notebook is showing multiple error notifications, the first one will have a link that allows you to clear all notifications at once.

Notebook notifications are enabled by default. You can disable them under User Settings –> Notebook Settings.

Notebook Isolation

Variable and Class Isolation

In Databricks notebooks, variables and classes that are not defined in a Scala package cell are only available in the current notebook. For example, two notebooks attached to the same cluster can define different variables and classes with the same name. It is worth noting that since user notebooks all execute locally on the same cluster VMs, there is no guaranteed user isolation within a cluster.

To define a class that is visible to all notebooks using the same cluster, you can define this class in a package cell. Then, you can access this class by using its fully qualified name, which is the same as accessing a class in an attached Scala/Java library.

Spark Session Isolation

Note

Spark Session Isolation is available in Spark 2.0.2-db1 and higher versions.

For a cluster running Apache Spark 2.0.0 or a higher version, every notebook has a pre-defined variable called spark representing a SparkSession. A SparkSession is the entry point for using different APIs in Spark as well as setting different runtime configurations. For Spark 2.0.0 and Spark 2.0.1-db1, notebooks attached to a cluster share the same SparkSession. Starting from Spark 2.0.2-db1, the creator of a cluster has the option to enable Spark Session Isolation to make every notebook attached to this cluster in its own session, i.e. every notebook uses its own SparkSession. To enable Spark Session Isolation, the creator of a cluster can set spark.databricks.session.share to false in the Spark Config field on the cluster creation page.

Note

By default, users of Spark 2.0.2-db1 will share a single spark session (i.e. spark.databricks.session.share is true by default). Spark 2.1.0 and higher versions will have session isolation enabled by default.

By setting spark.databricks.session.share to false, every attached notebook is in its own session, which means:

  • Runtime configurations set using spark.conf.set or using SQL’s set command only affect the current notebook. Please note that configurations for metastore connection are not runtime configurations and all notebooks attached to a cluster share these configurations.
  • Setting the current database only affects the current notebook.
  • Temporary views created by dataset.createTempView, dataset.createOrReplaceTempView or SQL’s CREATE TEMPORARY VIEW command are only visible in the current notebook.

Starting from Apache Spark 2.1, in order to share temporary views across notebooks, you can use global temporary views.

It is worth noting that cells that trigger commands in other languages (i.e. cells using %scala, %python, %r, and %sql) and cells that include other notebooks (i.e. cells using %run) are part of the current notebook. Thus, these cells are in the same session as other regular notebook cells. In contrast, Notebook Workflows run a notebook with an isolated SparkSession, which means temporary views defined in such a notebook are not visible by other notebooks.

Note that since all notebooks attached to the same cluster execute on the same cluster VMs, there is no guaranteed user isolation within a cluster, even with Spark Session Isolation enabled.

Version Control

Databricks has basic version control for notebooks. To access version control, click the Revision History menu on the top right of every notebook. You can specify revisions with comments and those will be permanently saved. For more explicit versioning, Databricks also integrates with GitHub Version Control to store these revision in GitHub. Please see that section for more details.

Deleting a Notebook

Since notebooks live inside of the workspace (and in folders in the workspace), they follow the same rules as folders. See the Accessing The Workspace Menu for more information about how to access the workspace menu and delete notebooks or other items in the workspace.