Work with notebooks and other files in Databricks Repos

This article show you how to work with notebooks and other files in Databricks Repos. In addition to syncing notebooks with a remote Git repository, you can sync any type of file, such as:

  • .py files

  • data files in .csv,.json or any other format

  • .yaml configuration files

You can import and read these files within a Databricks repo. You can also view and edit plain text files in the UI.

To enable support of non-notebook files in Repos, see the section Enable Files in Repos.

Create a notebook or folder in the UI

To create a new notebook or folder in a repo, click the down arrow next to the repo name, and select Create > Notebook or Create > Folder from the menu.

Repo create menu

To move an notebook or folder in your workspace into a repo, navigate to the notebook or folder and select Move from the drop-down menu:

Move object

In the dialog, select the repo to which you want to move the object:

Move repo

You can import a SQL or Python file as a single-cell Databricks notebook.

  • Add the comment line -- Databricks notebook source at the top of a SQL file.

  • Add the comment line # Databricks notebook source at the top of a Python file.

Work with files in the UI

This section covers how to add non-notebook files to a repo and view and edit files.

Requirements

Databricks Runtime 8.4 or above.

Create a new file

The most common way to create a file in a repo is to clone a Git repository. You can also create a new file directly from the Databricks repo. Click the down arrow next to the repo name, and select Create > File from the menu.

repos create file

Import a file

To import a file, click the down arrow next to the repo name, and select Import.

repos import file

The import dialog appears. You can drag files into the dialog or click browse to select files.

repos import dialog
  • Only notebooks can be imported from a URL.

  • When you import a .zip file, Databricks automatically unzips the file and imports each file and notebook that is included in the .zip file.

Edit a file

To edit a file in a repo, click the filename in the Repos browser. The file opens and you can edit it. Changes are saved automatically.

When you open a Markdown (.md) file, the rendered view is displayed by default. To edit the file, click in the file editor. To return to preview mode, click anywhere outside of the file editor.

Refactor code

A best practice for code development is to modularize code so it can be easily reused. You can create custom Python files in a repo and make the code in those files available to a notebook using the import statement. For an example, see the example notebook.

To refactor notebook code into reusable files:

  1. From the Repos UI, create a new branch.

  2. Create a new source code file for your code.

  3. Add Python import statements to the notebook to make the code in your new file available to the notebook.

  4. Commit and push your changes to your Git provider.

Create and edit files and directories programmatically

Databricks Runtime 11.2 or above.

In a Databricks Repo, you can programmatically create directories and create and append to files. This is useful for creating or modifying an environment specification file, writing output from notebooks, or writing output from execution of libraries, such as Tensorboard.

Note

To disable this feature, set the cluster environment variable to WSFS_ENABLE_WRITE_SUPPORT=false. For more information, see Environment variables.

Create a new directory

os.mkdir('dir1')

Create a new file and write to it

with open('dir1/new_file.txt', "w") as f:
    f.write("new content")

Append to a file

with open('dir1/new_file.txt', "a") as f:
    f.write(" continued")

Delete a file

os.remove('dir1/new_file.txt')

Delete a directory

os.rmdir('dir1')

Programmatically read files from a repo

Databricks Runtime 8.4 or above.

You can programmatically read small data files in a repo, such as .csv or .json files, directly from a notebook. Programmatically creating or editing files is only supported in Databricks Runtime 11.2 and above.

import pandas as pd
df = pd.read_csv("./data/winequality-red.csv")
df

You can use Spark to access files in a repo. Spark requires absolute file paths for file data. The absolute file path for a file in a repo is file:/Workspace/Repos/<user_folder>/<repo_name>/file.

You can copy the absolute or relative path to a file in a repo from the drop-down menu next to the file:

file drop down menu

The example below shows the use of {os.getcwd()} to get the full path.

import os
spark.read.format("csv").load(f"file:{os.getcwd()}/my_data.csv")

Example notebook for working with non-notebook files in Repos

This notebook shows examples of working with arbitrary files in Databricks Repos.

Arbitrary Files in Repos example notebook

Open notebook in new tab

Work with Python and R modules

Requirements

Databricks Runtime 8.4 or above.

Import Python and R modules

The current working directory of your repo and notebook are automatically added to the Python path. When you work in the repo root, you can import modules from the root directory and all subdirectories.

To import modules from another repo, you must add that repo to sys.path. For example:

import sys
sys.path.append("/Workspace/Repos/<user-name>/<repo-name>")

# to use a relative path
import sys
import os
sys.path.append(os.path.abspath('..'))

You import functions from a module in a repo just as you would from a module saved as a cluster library or notebook-scoped library:

from sample import power
power.powerOfTwo(3)
source("sample.R")
power.powerOfTwo(3)

Import Databricks Python notebooks

To distinguish between a regular Python file and a Databricks Python-language notebook exported in source-code format, Databricks adds the line # Databricks Notebook source at the top of the notebook source code file.

When you import the notebook, Databricks recognizes it and imports it as a notebook, not as a Python module.

If you want to import the notebook as a Python module, you must edit the notebook in a code editor and remove the line # Databricks Notebook source. Removing that line converts the notebook to a regular Python file.

Import precedence rules

When you use an import statement in a notebook in a repo, the library in the repo takes precedence over a library or wheel with the same name that is installed on the cluster.

Autoreload for Python modules

While developing Python code, if you are editing multiple files, you can use the following commands in any cell to force a reload of all modules.

%load_ext autoreload
%autoreload 2

Use Databricks web terminal for testing

You can use Databricks web terminal to test modifications to your Python or R code without having to import the file to a notebook and execute the notebook.

  1. Open web terminal.

  2. Change to the Repo directory: cd /Workspace/Repos/<path_to_repo>/.

  3. Run the Python or R file: python file_name.py or Rscript file_name.r.

Migration tips

If you are using %run commands to make Python or R functions defined in a notebook available to another notebook, or are installing custom .whl files on a cluster, consider including those custom modules in a Databricks repo. In this way, you can keep your notebooks and other code modules in sync, ensuring that your notebook always uses the correct version.

Migrate from %run commands

%run commands let you include one notebook within another and are often used to make supporting Python or R code available to a notebook. In this example, a notebook named power.py includes the code below.

# This code is in a notebook named "power.py".
def n_to_mth(n,m):
  print(n, "to the", m, "th power is", n**m)

You can then make functions defined in power.py available to a different notebook with a %run command:

# This notebook uses a %run command to access the code in "power.py".
%run ./power
n_to_mth(3, 4)

Using Files in Repos, you can directly import the module that contains the Python code and run the function.

from power import n_to_mth
n_to_mth(3, 4)

Migrate from installing custom Python .whl files

You can install custom .whl files onto a cluster and then import them into a notebook attached to that cluster. For code that is frequently updated, this process is cumbersome and error-prone. Files in Repos lets you keep these Python files in the same repo with the notebooks that use the code, ensuring that your notebook always uses the correct version.

For more information about packaging Python projects, see this tutorial.

Enable support for non-notebook files

To work with non-notebook files in Databricks Repos, you must be running Databricks Runtime 8.4 or above. If you are running Databricks Runtime 11.0 or above, support for arbitrary files is enabled by default.

If support for File in Repos is not enabled, you still see non-notebook files in a Databricks repo, but you cannot work with them.

Enable Files in Repos

An admin can enable this feature as follows:

  1. Go to the Admin Console.

  2. Click the Workspace Settings tab.

  3. In the Repos section, click the Files in Repos toggle.

After the feature has been enabled, you must restart your cluster and refresh your browser before you can work with non-noteboook files in Repos.

Additionally, the first time you access a repo after Files in Repos is enabled, you must open the Git dialog. The dialog indicates that you must perform a pull operation to sync non-notebook files in the repo. Select Agree and Pull to sync files. If there are any merge conflicts, another dialog appears giving you the option of discarding your conflicting changes or pushing your changes to a new branch.

Confirm Files in Repos is enabled

You can use the command %sh pwd in a notebook inside a repo to check if Files in Repos is enabled.

  • If Files in Repos is not enabled, the response is /databricks/driver.

  • If Files in Repos is enabled, the response is /Workspace/Repos/<path to notebook directory> .

Access files in Repos from a cluster that is using Databricks Container Services

You can access files in Repos on a cluster with Databricks Container Services (DCS) in Databricks Runtime versions 10.4 LTS and 9.1 LTS. Copy the following dockerfiles from public GitHub repos:

See Customize containers with Databricks Container Services