Repos for Git integration

Note

Support for arbitrary files in Databricks Repos is now in Public Preview. For details, see Work with non-notebook files in a Databricks repo and Import Python and R modules.

To support best practices for data science and engineering code development, Databricks Repos provides repository-level integration with Git providers. You can develop code in a Databricks notebook and sync it with a remote Git repository. Databricks Repos lets you use Git functionality such as cloning a remote repo, managing branches, pushing and pulling changes, and visually comparing differences upon commit.

Databricks Repos also provides an API that you can integrate with your CI/CD pipeline. For example, you can programmatically update a Databricks repo so that it always has the most recent code version.

Databricks Repos provides security features such as allow lists to control access to Git repositories and detection of clear text secrets in source code.

When audit logging is enabled, audit events are logged when you interact with a Databricks repo. For example, an audit event is logged when you create, update, or delete a Databricks repo, when you list all Databricks repos associated with a workspace, and when you sync changes between your Databricks repo and the Git remote.

For more information about best practices for code development using Databricks repos, see Best practices for integrating repos with CI/CD workflows.

Requirements

Databricks supports these Git providers:

  • GitHub
  • Bitbucket
  • GitLab
  • Azure DevOps

The Git server must be accessible from Databricks. Databricks does not support private Git servers, such as Git servers behind a VPN.

Support for arbitrary files in Databricks Repos is available in Databricks Runtime 8.4 and above.

Configure your Git integration with Databricks

  1. Click User Settings Icon Settings in your Databricks workspace and select User Settings from the menu.

  2. On the User Settings page, go to the Git Integration tab.

  3. Follow the instructions for integration with GitHub, Bitbucket Cloud, GitLab, or Azure DevOps.

    For Azure DevOps, Git integration does not support Azure Active Directory tokens. You must use an Azure DevOps personal access token.

  4. If your organization has SAML SSO enabled in GitHub, ensure that you have authorized your personal access token for SSO.

Enable support for arbitrary files in Databricks Repos

Preview

This feature is in Public Preview.

In addition to syncing notebooks with a remote Git repository, Files in Repos lets you sync any type of file, such as .py files, data files in .csv or .json format, or .yaml configuration files. You can import and read these files within a Databricks repo. You can also view and edit plain text files in the UI.

If support for this feature is not enabled, you will still see non-notebook files in your repo, but you will not be able to work with them.

Requirements

To work with non-notebook files in Databricks Repos, you must be running Databricks Runtime 8.4 or above.

Enable Files in Repos

An admin can enable this feature as follows:

  1. Go to the Admin Console.
  2. Click the Workspace Settings tab.
  3. In the Advanced section, click the Files in Repos toggle.
  4. Click Confirm.
  5. Refresh your browser.

Additionally, the first time you access a repo after Files in Repos is enabled, you must open the Git dialog. A dialog appears indicating that you must perform a pull operation to sync non-notebook files in the repo. Select Agree and Pull to sync files. If there are any merge conflicts, another dialog appears giving you the option of discarding your conflicting changes or pushing your changes to a new branch.

Clone a remote Git repository

You can clone a remote Git repository and work on your notebooks or files in Databricks. You can create notebooks, edit notebooks and other files, and sync with the remote repository. You can also create new branches for your development work. For some tasks you must work in your Git provider, such as creating a PR, resolving conflicts, merging or deleting branches, or rebasing a branch.

  1. Click Repos Icon Repos in the sidebar.

  2. Click Add Repo.

    Add repo
  3. In the Add Repo dialog, click Clone remote Git repo and enter the repository URL. Select your Git provider from the drop-down menu, optionally change the name to use for the Databricks repo, and click Create. The contents of the remote repository are cloned to the Databricks repo.

    Clone from repo

Work with notebooks in a Databricks repo

To create a new notebook or folder in a repo, click the down arrow next to the repo name, and select Create > Notebook or Create > Folder from the menu.

Repo create menu

To move an notebook or folder in your workspace into a repo, navigate to the notebook or folder and select Move from the drop-down menu:

Move object

In the dialog, select the repo to which you want to move the object:

Move repo

You can import a SQL or Python file as a single-cell Databricks notebook.

  • Add the comment line -- Databricks notebook source at the top of a SQL file.
  • Add the comment line # Databricks notebook source at the top of a Python file.

Work with non-notebook files in a Databricks repo

This section covers how to add files to a repo and view and edit files.

Preview

This feature is in Public Preview.

Requirements

Databricks Runtime 8.4 or above.

Create a new file

The most common way to create a file in a repo is to clone a Git repository. You can also create a new file directly from the Databricks repo. Click the down arrow next to the repo name, and select Create > File from the menu.

repos create file

Upload a file

To upload a file from your local system, click the down arrow next to the repo name, and select Upload File(s). You can drag files into the dialog or click browse to select files.

repos upload file

Edit a file

To edit a file in a repo, click the filename in the Repos browser. The file opens and you can edit it. Changes are saved automatically.

Access files in a repo programmatically

You can programmatically read small data files in a repo, such as .csv or .json files, directly from a notebook. You cannot programmatically create or edit files from a notebook.

import pandas as pd
df = pd.read_csv("./data/winequality-red.csv")
df

You can use Spark to access files in a repo. Spark requires absolute file paths for file data. The absolute file path for a file in a repo is file:/Workspace/Repos/<user_folder>/<repo_name>/file.

You can copy the absolute or relative path to a file in a repo from the drop-down menu next to the file:

file drop down menu

The example below shows the use of {os.getcwd()} to get the full path.

import os
spark.read.format("csv").load(f"file:{os.getcwd()}/my_data.csv")

Example notebook

This notebook shows examples of working with arbitrary files in repos.

Arbitrary files in repos example notebook

Open notebook in new tab

Work with Python and R modules

Preview

This feature is in Public Preview.

Requirements

Databricks Runtime 8.4 or above.

Import Python and R modules

The current working directory of your repo and notebook are automatically added to the Python path. When you work in the repo root, you can import modules from the root directory and all subdirectories.

To import modules from another repo, you must add that repo to sys.path. For example:

import sys
sys.path.append("/Workspace/Repos/<user-name>/<repo-name>")

# to use a relative path
import sys
import os
sys.path.append(os.path.abspath('..'))

You import functions from a module in a repo just as you would from a module saved as a cluster library or notebook-scoped library:

from sample import power
power.powerOfTwo(3)
source("sample.R")
power.powerOfTwo(3)

Autoreload for Python modules

While developing Python code, if you are editing multiple files, you can run the command %autoreload 2 in any cell to force a reload of all modules.

Sync with a remote Git repository

To sync with Git, use the Git dialog. The Git dialog lets you pull changes from your remote Git repository and push and commit changes. You can also change the branch you are working on or create a new branch.

Important

Git operations that pull in upstream changes clear the notebook state. For more information, see Incoming changes clear the notebook state.

Open the Git dialog

You can access the Git dialog from a notebook or from the repos browser.

  • From a notebook, click the button at the top left of the notebook that identifies the current Git branch.

    Git dialog button on notebook
  • From the repos browser, click the button to the right of the repo name:

    Git dialog button in repo browser

    You can also click the down arrow next to the repo name, and select Git… from the menu.

    Repos menu 2

Pull changes from the remote Git repository

To pull changes from the remote Git repository, click Pullin the Git dialog. Notebooks and other files are updated automatically to the latest version in your remote repository.

A message appears if there are merge conflicts. Databricks recommends that you resolve the merge conflict using your Git provider interface.

Commit and push changes to the remote Git repository

When you have added new notebooks or files, or made changes to existing notebooks or files, the Git dialog highlights the changes.

git dialog

Add a required Summary of the changes, and click Commit & Push to push these changes to the remote Git repository.

If you don’t have permission to commit to the default branch, such as main, create a new branch and use your Git provider interface to create a pull request (PR) to merge it into the default branch.

Note

  • Results are not included with a notebook commit. All results are cleared before the commit is made.
  • If there are merge conflicts, Databricks recommends that you create a new branch, commit and push your changes to that branch, work in that branch, and resolve the merge conflict using your Git provider interface.

Create a new branch

You can create a new branch based on an existing branch from the Git dialog:

Git dialog new branch

Control access to Databricks Repos

Manage permissions

When you create a repo, you have Can Manage permission. This lets you perform Git operations or modify the remote repository. You can clone public remote repositories without Git credentials (personal access token and username). To modify a public remote repository, or to clone or modify a private remote repository, you must have a Git provider username and personal access token with read and write permissions for the remote repository.

Use allow lists

An admin can limit which remote repos users can commit and push to.

  1. Go to the Admin Console.
  2. Click the Workspace Settings tab.
  3. In the Advanced section, click the Enable Repos Git URL Allow List toggle.
  4. Click Confirm.
  5. In the field next to Repos Git URL Allow List: Empty list, enter a comma-separated list of URL prefixes.
  6. Click Save.

Users can only commit and push to Git repositories that start with one of the URL prefixes you specify. The default setting is “Empty list”, which disables access to all repositories. To allow access to all repositories, disable Enable Repos Git URL Allow List.

Note

  • The list you save overwrites the existing set of saved URL prefixes.
  • It may take about 15 minutes for changes to take effect.

Secrets detection

Repos scans code for access key IDs that begin with the prefix AKIA and warns the user before committing.

Repos API

The Repos API update endpoint allows you to update a repo to the latest version of a specific Git branch or to a tag. This enables you to update the repo before you run a job against a notebook in the repo. For details, see Repos API 2.0.

Best practices for integrating repos with CI/CD workflows

This section includes best practices for integrating Databricks repos with your CI/CD workflow. The following figure shows an overview of the steps.

Best practices overview

Admin workflow

Repos have user-level folders and non-user top level folders. User-level folders are automatically created when users first clone a remote repository. You can think of repos in user folders as “local checkouts” that are individual for each user and where users make changes to their code.

Set up top-level repo folders

Admins can create non-user top level folders. The most common use case for these top level folders is to create Dev, Staging, and Production folders that contain repos for the appropriate versions or branches for development, staging, and production. For example, if your company uses the Main branch for production, the Production folder would contain repos configured to be at the Main branch.

Typically permissions on these top-level folders are read-only for all non-admin users within the workspace.

Top-level repo folders
Set up Git automation to update repos on merge

To ensure that repos are always at the latest version, you can set up Git automation to call the Repos API. In your Git provider, set up automation that, after every successful merge of a PR into the Main branch, calls the Repos API endpoint on the appropriate repo in the Production folder to bring that repo to the latest version.

For example, on GitHub this can be achieved with GitHub Actions.

User workflow

To start a workflow, clone your remote repository into a user folder. A best practice is to create a new feature branch, or select a previously created branch, for your work, instead of directly committing and pushing changes to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to merge your code, create a pull request and follow the review and merge processes in Git.

Production job workflow

You can point jobs directly to notebooks in repos. When a job kicks off a run, it uses the current version of the code in the repo.

If the automation is setup as described in Admin workflow, every successful merge calls the Repos API to update the repo. As a result, jobs that are configured to run code from a repo always use the latest version available when the job run was created.

Migration tips

Preview

This feature is in Public Preview.

If you are using %run commands to make Python or R functions defined in a notebook available to another notebook, or are installing custom .whl files on a cluster, consider including those custom modules in a Databricks repo. In this way, you can keep your notebooks and other code modules in sync, ensuring that your notebook always uses the correct version.

Migrate from %run commands

%run commands let you include one notebook within another and are often used to make supporting Python or R code available to a notebook. In this example, a notebook named power.py includes the code below.

# This code is in a notebook named "power.py".
def n_to_mth(n,m):
  print(n, "to the", m, "th power is", n**m)

You can then make functions defined in power.py available to a different notebook with a %run command:

# This notebook uses a %run command to access the code in "power.py".
%run ./power
n_to_mth(3, 4)

Using Files in Repos, you can directly import the module that contains the Python code and run the function.

from power import n_to_mth
n_to_mth(3, 4)

Migrate from installing custom Python .whl files

You can install custom .whl files onto a cluster and then import them into a notebook attached to that cluster. For code that is frequently updated, this process is cumbersome and error-prone. Files in Repos lets you keep these Python files in the same repo with the notebooks that use the code, ensuring that your notebook always uses the correct version.

For more information about packaging Python projects, see this tutorial.

Limitations and FAQ

Incoming changes clear the notebook state

Git operations that alter the notebook source code result in the loss of the notebook state, including cell results, comments, revision history, and widgets. For example, Git pull can change the source code of a notebook. In this case, Databricks repos must overwrite the existing notebook to import the changes. Git commit and push or creating a new branch do not affect the notebook source code, so the notebook state is preserved in these operations.

What happens if a job starts running on a notebook while a Git operation is in progress?

At any point while a Git operation is in progress, some notebooks in the Repo may have been updated while others have not. This can cause unpredictable behavior.

For example, suppose notebook A calls notebook Z using a %run command. If a job running during a Git operation starts the most recent version of notebook A, but notebook Z has not yet been updated, the %run command in notebook A might start the older version of notebook Z. During the Git operation, the notebook states are not predictable and the job might fail or run notebook A and notebook Z from different commits.

How can I run non-Databricks notebook files in a repo? For example, a .py file?

You can use any of the following:

Can I create top-level folders that are not user folders?

Yes, admins can create top-level folders to a single depth. Repos does not support additional folder levels.

How and where are the Github tokens stored in Databricks? Who would have access from Databricks?

  • The authentication tokens are stored in the Databricks control plane, and a Databricks employee can only gain access through a temporary credential that is audited.
  • Databricks logs the creation and deletion of these tokens, but not their usage. Databricks has logging that tracks Git operations that could be used to audit the usage of the tokens by the Databricks application.
  • Github enterprise audits token usage. Other Git services may also have Git server auditing.

Does Repos support Git submodules?

No. You can clone a repo that contains Git submodules, but the submodule is not cloned.

Does Repos support SSH?

No, only HTTPS.

Does Repos support .gitignore files?

Yes. If you add a file to your repo and do not want it to be tracked by Git, create a .gitignore file or use one cloned from your remote repository and add the filename, including the extension.

.gitignore works only for files that are not already tracked by Git. If you add a file that is already tracked by Git to a .gitignore file, the file is still tracked by Git.

Can I pull the latest version of a repository from Git before running a job without relying on an external orchestration tool?

No. Typically you can integrate this as a pre-commit on the Git server so that every push to a branch (main/prod) updates the Production repo.

Can I pull in .ipynb files?

Yes. The file renders in .json format, not notebook format.

Are there limits on the size of a repo or the number of files?

Databricks does not enforce a limit on the size of a repo. Working branches are limited to 200MB. Individual files are limited to 10MB.

Databricks recommends that the total number of notebooks and files in a repo not exceed 1000.

You may receive an error message if these limits are exceeded. You may also receive a timeout error on the initial clone of the repo, but the operation might complete in the background.

Does Repos support branch merging?

No. Databricks recommends that you create a pull request and merge through your Git provider.

Are the contents of Databricks repos encrypted?

The contents of repos are encrypted by Databricks using a platform-managed key. Encryption using Customer-managed keys for managed services is not supported.

Can I delete a branch from a Databricks repo?

No. To delete a branch, you must work in your Git provider.

Where is Databricks repo content stored?

The contents of a repo are temporarily cloned onto disk in the control plane. Databricks notebook files are stored in the control plane database just like notebooks in the main workspace. Non-notebook files may be stored on disk for up to 30 days.

How can I disable Repos in my workspace?

Follow these steps to disable Repos for Git in your workspace.

  1. Go to the Admin Console.
  2. Click the Workspace Settings tab.
  3. In the Advanced section, click the Repos toggle.
  4. Click Confirm.
  5. Refresh your browser.

Files in Repos limitations

Preview

This feature is in Public Preview.

  • Files in Repos is not compatible with Spark Streaming. To use Spark Streaming, you must disable Files in Repos on the cluster. Set the Spark configuration spark.databricks.enableWsfs false.
  • Native file reads are supported in Python and R notebooks. Native file reads are not supported in Scala notebooks, but you can use Scala notebooks with DBFS as you do today.
  • The diff view in the Git dialog is not available for files.
  • Only text encoded files are rendered in the UI. To view files in Databricks, the files must not be larger than 10 MB.
  • You cannot create or edit a file from your notebook.

Troubleshooting

Error message: Invalid credentials

Try the following:

  • Confirm that the settings in the Git integration tab (User Settings > Git Integration) are correct.

    • You must enter both your Git provider username and token. Legacy Git integrations did not require a username, so you may need to add a username to work with repos.
  • Confirm that you have selected the correct Git provider in the Add Repo dialog.

  • Ensure your personal access token or app password has the correct repo access.

  • If SSO is enabled on your Git provider, authorize your tokens for SSO.

  • Test your token with command line Git. Both of these options should work:

    git clone https://<username>:<personal-access-token>@github.com/<org>/<repo-name>.git
    
    git clone -c http.sslVerify=false -c http.extraHeader='Authorization: Bearer <personal-access-token>' https://agile.act.org/
    

Error message: Secure connection could not be established because of SSL problems

<link>: Secure connection to <link> could not be established because of SSL problems

This error occurs if your Git server is not accessible from Databricks. Private Git servers are not supported.

Timeout errors

Expensive operations such as cloning a large repo or checking out a large branch may hit timeout errors, but the operation might complete in the background. You can also try again later if the workspace was under heavy load at the time.

404 errors

If you get a 404 error when you try to open a non-notebook file, try waiting a few minutes and then trying again. There is a delay of a few minutes between when the workspace is enabled and when the webapp picks up the configuration flag.

resource not found errors after pulling non-notebook files into a Databricks repo

This error can occur if you are not using Databricks Runtime 8.4 or above. A cluster running Databricks Runtime 8.4 or above is required to work with non-notebook files in a repo.

Errors suggesting re-cloning

There was a problem with deleting folders. The repo could be in an inconsistent state and re-cloning is recommended.

This error indicates that a problem occurred while deleting folders from the repo. This could leave the repo in an inconsistent state, where folders that should have been deleted still exist. If this error occurs, Databricks recommends deleting and re-cloning the repo to reset its state.

Unable to set repo to most recent state. This may be due to force pushes overriding commit history on the remote repo. Repo may be out of sync and re-cloning is recommended.

This error indicates that the local and remote Git state have diverged. This can happen when a force push on the remote overrides recent commits that still exist on the local repo. Databricks does not support a hard reset within Repos and recommends deleting and re-cloning the repo if this error occurs.

My admin enabled Files in Repos, but expected files do not appear after cloning a remote repository or pulling files into an existing one

  • You must refresh your browser and restart your cluster to pick up the new configuration.
  • Your cluster must be running Databricks Runtime 8.4 or above.