Work with notebooks and project files in Databricks Repos

This article walks you through steps for working with notebooks and other files in Databricks Repos with a remote Git integration.

In Databricks you can:

  • Clone a remote Git respository.

  • Work in notebooks or files.

  • Create notebooks, and edit notebooks and other files.

  • Sync with a remote repository.

  • Create new branches for development work.

For other tasks, you work in your Git provider:

  • Creating a PR

  • Resolving conflicts

  • Merging or deleting branches

  • Rebasing a branch

Clone a remote Git repository

When you clone a remote Git repository, you can then work on the notebooks or other files in Databricks.

  1. Click Repos Icon Repos in the sidebar.

  2. Click Add Repo.

    Add repo
  3. In the Add Repo dialog, click Clone remote Git repo and enter the repository URL. Select your Git provider from the drop-down menu, optionally change the name to use for the Databricks repo, and click Create. The contents of the remote repository are cloned to the Databricks repo.

    Clone from repo

Create a notebook or folder

To create a new notebook or folder in a repo, click the down arrow next to the repo name, and select Create > Notebook or Create > Folder from the menu.

Repo create menu

To move an notebook or folder in your workspace into a repo, navigate to the notebook or folder and select Move from the drop-down menu:

Move object

In the dialog, select the repo to which you want to move the object:

Move repo

You can import a SQL or Python file as a single-cell Databricks notebook.

  • Add the comment line -- Databricks notebook source at the top of a SQL file.

  • Add the comment line # Databricks notebook source at the top of a Python file.

Work with non-notebook files in a Databricks repo

This section covers how to add files to a repo and view and edit files.

Preview

This feature is in Public Preview.

Requirements

Databricks Runtime 8.4 or above.

Create a new file

The most common way to create a file in a repo is to clone a Git repository. You can also create a new file directly from the Databricks repo. Click the down arrow next to the repo name, and select Create > File from the menu.

repos create file

Import a file

To import a file, click the down arrow next to the repo name, and select Import.

repos import file

The import dialog appears. You can drag files into the dialog or click browse to select files.

repos import dialog
  • Only notebooks can be imported from a URL.

  • When you import a .zip file, Databricks automatically unzips the file and imports each file and notebook that is included in the .zip file.

Edit a file

To edit a file in a repo, click the filename in the Repos browser. The file opens and you can edit it. Changes are saved automatically.

When you open a Markdown (.md) file, the rendered view is displayed by default. To edit the file, click in the file editor. To return to preview mode, click anywhere outside of the file editor.

Refactor code

A best practice for code development is to modularize code so it can be easily reused. You can create custom Python files in a repo and make the code in those files available to a notebook using the import statement. For an example, see the example notebook.

To refactor notebook code into reusable files:

  1. From the Repos UI, create a new branch.

  2. Create a new source code file for your code.

  3. Add Python import statements to the notebook to make the code in your new file available to the notebook.

  4. Commit and push your changes to your Git provider.

Access files in a repo programmatically

You can programmatically read small data files in a repo, such as .csv or .json files, directly from a notebook. You cannot programmatically create or edit files from a notebook.

import pandas as pd
df = pd.read_csv("./data/winequality-red.csv")
df

You can use Spark to access files in a repo. Spark requires absolute file paths for file data. The absolute file path for a file in a repo is file:/Workspace/Repos/<user_folder>/<repo_name>/file.

You can copy the absolute or relative path to a file in a repo from the drop-down menu next to the file:

file drop down menu

The example below shows the use of {os.getcwd()} to get the full path.

import os
spark.read.format("csv").load(f"file:{os.getcwd()}/my_data.csv")

Example notebook

This notebook shows examples of working with arbitrary files in Databricks Repos.

Arbitrary Files in Repos example notebook

Open notebook in new tab

Work with Python and R modules

Preview

This feature is in Public Preview.

Requirements

Databricks Runtime 8.4 or above.

Import Python and R modules

The current working directory of your repo and notebook are automatically added to the Python path. When you work in the repo root, you can import modules from the root directory and all subdirectories.

To import modules from another repo, you must add that repo to sys.path. For example:

import sys
sys.path.append("/Workspace/Repos/<user-name>/<repo-name>")

# to use a relative path
import sys
import os
sys.path.append(os.path.abspath('..'))

You import functions from a module in a repo just as you would from a module saved as a cluster library or notebook-scoped library:

from sample import power
power.powerOfTwo(3)
source("sample.R")
power.powerOfTwo(3)

Import Databricks Python notebooks

To distinguish between a regular Python file and a Databricks Python-language notebook exported in source-code format, Databricks adds the line # Databricks Notebook source at the top of the notebook source code file.

When you import the notebook, Databricks recognizes it and imports it as a notebook, not as a Python module.

If you want to import the notebook as a Python module, you must edit the notebook in a code editor and remove the line # Databricks Notebook source. Removing that line converts the notebook to a regular Python file.

Import precedence rules

When you use an import statement in a notebook in a repo, the library in the repo takes precedence over a library or wheel with the same name that is installed on the cluster.

Autoreload for Python modules

While developing Python code, if you are editing multiple files, you can use the following commands in any cell to force a reload of all modules.

%load_ext autoreload
%autoreload 2

Use Databricks web terminal for testing

You can use Databricks web terminal to test modifications to your Python or R code without having to import the file to a notebook and execute the notebook.

  1. Open web terminal.

  2. Change to the Repo directory: cd /Workspace/Repos/<path_to_repo>/.

  3. Run the Python or R file: python file_name.py or Rscript file_name.r.

Run jobs using notebooks in a remote repository

You can run a Databricks job using notebooks located in a remote Git repository. This is especially useful for managing CI/CD for production runs. See Create a job.