Git integration with Databricks Git folders

Important

Databricks has replaced the “Repos” feature with integrated Git folder functionality within Databricks workspaces. For more details on this change, read What happened to Databricks Repos?.

Databricks Git folders is a visual Git client and API in Databricks. It supports common Git operations such as cloning a repository, committing and pushing, pulling, branch management, and visual comparison of diffs when committing.

Within Git folders you can develop code in notebooks or other files and follow data science and engineering code development best practices using Git for version control, collaboration, and CI/CD.

Note

Git folders (Repos) are primarily designed for authoring and collaborative workflows.

What can you do with Databricks Git folders?

Databricks Git folders provides source control for data and AI projects by integrating with Git providers.

In Databricks Git folders, you can use Git functionality to:

  • Clone, push to, and pull from a remote Git repository.

  • Create and manage branches for development work, including merging, rebasing, and resolving conflicts.

  • Create notebooks (including IPYNB notebooks) and edit them and other files.

  • Visually compare differences upon commit and resolve merge conflicts.

For step-by-step instructions, see Run Git operations on Databricks Git folders (Repos).

Note

Databricks Git folders also has an API that you can integrate with your CI/CD pipeline. For example, you can programmatically update a Databricks repo so that it always has the most recent version of the code. For information about best practices for code development using Databricks Git folders, see CI/CD techniques with Git and Databricks Git folders (Repos).

For information on the kinds of notebooks supported in Databricks, see Export and import Databricks notebooks.

Supported Git providers

Databricks Git folders are backed by an integrated Git repository. The repository can be hosted by any of the cloud and enterprise Git providers listed in the following section.

Note

What is a “Git provider”?

A “Git provider” is the specific (named) service that hosts a source control model based on Git. Git-based source control platforms are hosted in two ways: as a cloud service hosted by the developing company, or as an on-premises service installed and managed by your own company on its own hardware. Many Git providers such as GitHub, Microsoft, GitLab, and Atlassian provide both cloud-based SaaS and on-premises (sometimes called “self-managed”) Git services.

When choosing your Git provider during configuration, you must be aware of the differences between cloud (SaaS) and on-premises Git providers. On-premises solutions are typically hosted behind a company VPN and might not be accessible from the internet. Usually, the on-premises Git providers have a name ending in “Server” or “Self-Managed”, but if you are uncertain, contact your company admins or review the Git provider’s documentation.

Note

If you are using “GitHub” as a provider and are still uncertain if you are using the cloud or on-premises version, see About GitHub Enterprise Server in the GitHub docs.

Cloud Git providers supported by Databricks

  • GitHub, GitHub AE, and GitHub Enterprise Cloud

  • Atlassian BitBucket Cloud

  • GitLab and GitLab EE

  • Microsoft Azure DevOps (Azure Repos)

  • AWS CodeCommit

On-premises Git providers supported by Databricks

  • GitHub Enterprise Server

  • Atlassian BitBucket Server and Data Center

  • GitLab Self-Managed

  • Microsoft Azure DevOps Server: A workspace admin must explicitly allowlist the URL domain prefixes for your Microsoft Azure DevOps Server if the URL does not match dev.azure.com/* or visualstudio.com/*. For more details, see Restrict usage to URLs in an allow list

If you are integrating an on-premises Git repo that is not accessible from the internet, a proxy for Git authentication requests must also be installed within your company’s VPN. For more details, see Set up private Git connectivity for Databricks Git folders (Repos).

To learn how to use access tokens with your Git provider, see Configure Git credentials & connect a remote repo to Databricks.

Resources for Git integration

Use the Databricks CLI 2.0 for Git integration with Databricks:

Read the following reference docs: