Run Git operations on Databricks Repos

The article describes how to clone a Git repo and perform other common Git operations with Databricks Repos. Databricks Repos is a visual Git client that integrates with the Databricks user interface and provides access to connected repos, which are represented as Git folders in your workspace.

Important

If you clone a Git repo using the CLI through a cluster’s web terminal, the files won’t display in the Databricks UI.

If you are unable to clone the repo and you’re using Azure DevOps with Microsoft Entra ID (formerly Azure Active Directory) authentication, see Issue with control access policy (CAP).

Add a repo and connect remotely later

You can also create a new Repo in Databricks and add the remote Git repository URL later.

  1. To create a new Databricks Repo not linked to a remote Git repository, click the Add Repo button. Deselect Create repo by cloning a Git repository, enter a name for the Repo, and then click Create Repo.

    Add repo without connecting remotely.
  2. When you are ready to add the Git repository URL, click the down arrow next to the Databricks Repo name in the workspace to open the Repo menu, and select Git… to open the Git dialog.

    Repos menu: Add a Git repo URL.
  3. In the Git repo URL field, enter the URL for the remote repository and select your Git provider from the drop-down menu. Click Save.

    Git dialog settings tab.

Clone a repo connected to a remote repo

  1. In the sidebar, select Workspace > Repos.

  2. Click Add Repo.

    Add repo UI.
  3. In the Add Repo dialog, select Create repo by cloning a Git repository and enter the repository URL.

  4. Select your Git provider from the drop-down menu, optionally change the name to use for the Databricks repo, and click Create Repo. The contents of the remote repository are cloned to the Databricks repo.

    Clone from Repo UI.

At this stage, you have the option to clone only a subset of your repository’s directories using sparse checkout. This is useful if your repository is larger than Databricks supported limits

Access the Git dialog

You can access the Git dialog from a notebook or from the Databricks Repos browser.

  • From a notebook, click the button next to the name of the notebook that identifies the current Git branch.

    Git dialog button on notebook.
  • From the Databricks Repos browser, click the button to the right of the repo name. You can also right-click the repo name and select Git… from the menu.

    Git dialog button and Git menu in repo browser.

Pull changes from the remote Git repository

To pull changes from the remote Git repository, click Pull in the Git dialog. Notebooks and other files are updated automatically to the latest version in your remote Git repository.

Important

Git operations that pull in upstream changes clear the notebook state. For more information, see Incoming changes clear the notebook state.

Merge branches

The merge function in Databricks Repos merges one branch into another using git merge.

  • If there’s a merge conflict, resolve it in the Repos UI shown later in this article.

  • If there’s no conflict, the merge is pushed to the remote Git repo using git push.

Rebase a branch on another branch

Rebasing alters the commit history of a branch. Like git merge, git rebase integrates changes from one branch into another. Rebase does the following:

  1. Saves the commits on your current branch to a temporary area.

  2. Resets the current branch to the chosen branch.

  3. Reapplies each individual commit previously saved on the current branch, resulting in a linear history that combines changes from both branches.

For an in-depth explanation of rebasing, see git rebase.

Warning

Using rebase can cause versioning issues for collaborators working in the same repo.

A common workflow is to rebase a feature branch on the main branch.

To rebase a branch on another branch:

  1. From the Branch menu in the Repos UI, select the branch you want to rebase.

  2. Select Rebase from the kebab menu.

    Git rebase function on the kebab menu.
  3. Select the branch you want to rebase on.

    The rebase operation integrates changes from the branch you choose here into the current branch.

Databricks Repos runs git commit and git push --force to update the remote Git repo.

Resolve merge conflicts

If an operation such as pull, rebase, or merge causes a merge conflict, the Repos UI shows a list of files with conflicts and options for resolving the conflicts.

You have two primary options:

  • Use the Repos UI to resolve the conflict.

  • Abort the Git operation, manually discard the changes in the conflicting file, and try the Git operation again.

Commit and push changes to the remote Git repository

When you have added new notebooks or files, or made changes to existing notebooks or files, the Repos UI highlights the changes.

Git dialog with changes highlighted.

Add a required commit message for the changes, and click Commit & Push to push these changes to the remote Git repository.

If you don’t have permission to commit to the default branch, such as main, create a new branch and use your Git provider interface to create a pull request (PR) to merge it into the default branch.

Note

  • Results are not included with a notebook commit. All results are cleared before the commit is made.

  • See instructions for resolving merge conflicts earlier in this article.

Switch to a different branch

You can switch to (checkout) a different branch using the branch dropdown in the Git dialog:

Git dialog switch to different branch

Create a new branch

You can create a new branch based on an existing branch from the Git dialog:

Git dialog new branch.

Git reset

In Databricks Repos, you can perform a Git reset within the Databricks UI. Git reset in Databricks Repos is equivalent to git reset --hard combined with git push --force.

Git reset replaces the branch contents and history with the most recent state of another branch. You can use this when edits are in conflict with the upstream branch, and you don’t mind losing those edits when you reset to the upstream branch. Read more about git `reset –hard`.

Reset to an upstream (remote) branch

With git reset in this scenario:

  • You reset your selected branch (for example, feature_a) to a different branch (for example, main).

  • You also reset the upstream (remote) branch feature_a to main.

Important

When you reset, you lose all uncommitted and committed changes in both the local and remote version of the branch.

To reset a branch to a remote branch:

  1. In the Repos UI from the Branch menu, choose the branch you want to reset.

    Branch selector in the Repos UI.
  2. Select Reset from the kebab menu.

    Git reset operation on the kebab menu.
  3. Select the branch to reset.

    Git reset --hard dialog.

Configure sparse checkout mode

Sparse checkout is a client side setting which allows you to clone and work with only a subset of the remote repositories’s directories in Databricks. This is especially useful if your repository’s size is beyond the Databricks supported limits.

You can use the Sparse Checkout mode when adding (cloning) a new repo.

  1. In the Add Repo dialog, open Advanced.

  2. Select Sparse checkout mode.

    Sparse checkout option in the Add Repo dialog.
  3. In the Cone patterns box, specify the cone checkout patterns you want. Separate multiple patterns by line breaks.

At this time, you can’t disable sparse checkout for a repo in Databricks.

How cone patterns work

To understand how cone pattern works in the sparse checkout mode, see the following diagram representing the remote repository structure.

Remote repository structure without sparse checkout.

If you select Sparse checkout mode, but do not specify a cone pattern, the default cone pattern is applied. This includes only the files in root and no subdirectories, resulting in a repo structure as following:

Sparse checkout: default cone pattern.

Setting the sparse checkout cone pattern as parent/child/grandchild results in all contents of the grandchild directory being recursively included. The files immediately in the /parent, /parent/child and root directory are also included. See the directory structure in the following diagram:

Sparse checkout: Specify parent-grandchild-child folder cone pattern.

You can add multiple patterns separated by line breaks.

Note

Exclusion behaviors (!) are not supported in Git cone pattern syntax.

Modify sparse checkout settings

Once a repo is created, the sparse checkout cone pattern can be edited from Settings > Advanced > Cone patterns.

Note the following behavior:

  • Removing a folder from the cone pattern removes it from Databricks if there are no uncommitted changes.

  • Adding a folder via editing the sparse checkout cone pattern adds it to Databricks without requiring an additional pull.

  • Sparse checkout patterns cannot be changed to remove a folder when there are uncommitted changes in that folder.

    For example, a user edits a file in a folder and does not commit changes. She then tries to change the sparse checkout pattern to not include this folder. In this case, the pattern is accepted, but the actual folder is not deleted. She needs to revert the pattern to include that folder, commit changes, and then reapply the new pattern.

Note

You can’t disable sparse checkout for a repo that was created with Sparse Checkout mode enabled.

Make and push changes with sparse checkout

You can edit existing files and commit and push them from the Repos interface. When creating new folders of files, include them in the cone pattern you specified for that repo.

Including a new folder outside of the cone pattern results in an error during the commit and push operation. To fix it, edit the cone pattern to include the new folder you are trying to commit and push.

Control .ipynb notebook output commits

To use this feature, you need to enable commit .ipynb notebook outputs. See Allow committing .ipynb notebook output.

When you commit an .ipynb file, Databricks can create a config file to help you control how you commit outputs: .databricks/commit_outputs.

  1. If you have a .ipynb notebook file but no config file in your repo, open the Git Status modal.

  2. In the notification, click Create commit_outputs file.

    Notebook commit UI: Create commit_outputs file button.

Alternatively, if not present, a config file can also be made from the file menu. The File menu has a status and control allowing you to automatically update the config file to specify including or excluding outputs for a specific notebook.

  1. On the File menu, select Commit notebooks outputs.

    Noteboook editor: Commit notebooks outputs status and control.
  2. In the dialog box, confirm your choice to commit notebook outputs.

    Commit notebooks outputs dialog box.

Patterns for a repo config file

The commit outputs config file uses patterns similar to gitignore patterns and does the following:

  • Positive patterns enable outputs inclusion for matching notebooks.

  • Negative patterns disable outputs inclusion for matching notebooks.

  • Patterns are evaluated in order for all notebooks.

  • Invalid paths or paths not resolving to .ipynb notebooks are ignored.

Positive pattern: To include outputs from a notebook path folder/innerfolder/notebook.ipynb, use following patterns:

**/*
folder/**
folder/innerfolder/note*

Negative pattern: To exclude outputs for a notebook, check that none of the positive patterns match or add a negative pattern in a correct spot of the configuration file. Negative (exclude) patterns start with !:

!folder/innerfolder/*.ipynb
!folder/**/*.ipynb
!**/notebook.ipynb

Use the Repos API

Manage Git provider PATS with the Repos API.