Learn how to set up Databricks Repos for version control. Once you set up Databricks Repos, you can use it to perform common Git operations such as clone, checkout, commit, push, pull, and branch management. You can also see diffs for your changes as you develop with notebooks and files in Databricks.
Databricks Repos uses a personal access token (PAT) or an equivalent credential to authenticate with your Git provider to perform operations such as clone, push, pull etc. To use Repos you first need to add your Git PAT and Git provider username to Databricks. See Configure Git credentials & connect a remote repo to Databricks.
You can clone public remote repositories without Git credentials (a personal access token and a username). To modify a public remote repository, or to clone or modify a private remote repository, you must have a Git provider username and personal access token with read and write permissions for the remote repository.
Repos is enabled by default, but functionality is turned on or off by a workspace admin.
Databricks Repos supports just one Git credential per user, per workspace.
Select the down arrow next to the account name at the top right of your screen, and then select User Settings.
Select the Linked accounts tab.
If you’re adding credentials for the first time, follow the on-screen instructions. If you have previously entered credentials, click Config > Edit.
In the Git provider drop-down, select the provider name.
In the box provided, add your Git user name or email.
In the Token box, add a personal access token (PAT) or other credentials from your Git provider. For details, see Configure Git credentials & connect a remote repo to Databricks
Databricks recommends that you set an expiration date for all personal access tokens.
For Azure DevOps, Git integration does not support Azure Active Directory tokens. You must use an Azure DevOps personal access token. See Connect to Azure DevOps project using a DevOps token.
If your organization has SAML SSO enabled in GitHub, authorize your personal access token for SSO.
Paste your Git provider PAT token into the Token field.
Enter your username into the Git provider username field and click Save.
You can also save a Git PAT token and username to Databricks by using the Databricks Repos API.
If you are unable to clone the repo and you’re using Azure DevOps with Azure Active Directory authentication, see Issue with control access policy (CAP).
Databricks Repos needs network connectivity to your Git provider to function. Ordinarily, this is over the internet and works out of the box. However, you might have set up additional restrictions on your Git provider for controlling access. For example, you might have a IP allow list in place or you might host your own private Git server—through services like GitHub Enterprise(GHE), Bitbucket Server(BBS), or Gitlab Self-managed—and your Git server might not be accessible via the internet.
If your Git server is internet-accessible but has an IP allow list in place, for example GitHub allow lists, you must add Databricks control plane NAT IPs to the Git server’s IP allow list. For a list of control plane NAT IP addresses by region, see Databricks clouds and regions. Use the IP for the region that your Databricks workspace is in.
If you are privately hosting a Git server, contact your Databricks representative for onboarding instructions for access.
Databricks Repos has many security features. Following sections walk you through their setup and use:
Use of encrypted Git credentials.
An allow list
Workspace access control
You can use AWS Key Management Service to encrypt a Git personal access token (PAT) or other Git credential. Using a key from an encryption service is referred to as a customer-managed key (CMK) or bring your own key (BYOK).
For more information, see Customer-managed keys for managed services
A workspace admin can limit which remote repositories users can clone from and commit & push to. This helps prevent exfiltration of your code; for example, users cannot push code to an arbitrary repository if you have turned on the allow list restrictions. You can also prevent users from using unlicensed code by restricting clone operation to a list of allowed repositories.
To set up an allow list:
Go to the Admin Settings page.
Click the Workspace settings tab.
In the Repos section, choose an option from Repos Git Allow List:
Disabled (no restrictions): There are no checks against the allow list.
Restrict clone, commit & push to allowed Git repositories: Clone, commit, and push operations are allowed only for repository URLs in the allow list.
Only restrict commit & push to allowed Git repositories: Commit and push operations are allowed only for repository URLs in the allow list. Clone and pull operations are not restricted.
In the field next to Repos Git URL Allow List: Empty list, enter a comma-separated list of URL prefixes.
To allow access to all repositories, choose Disable (no restrictions).
The list you save overwrites the existing set of saved URL prefixes.
It can take up to 15 minutes for the changes to take effect.
Set permissions for a repo to control access. Permissions for a repo apply to all content in that repo. Menu options are Can Manage, Can Edit, Can Run and Can View.
When you create a repo, you have Can Manage permission on it. This lets you modify content in the Repo, perform Git operations or modify the remote repository. Develop in your own isolated Repo and collaborate on a shared code base via Git branches and PRs. Don’t give other users Can Edit or Can Manage access to your development Repo.
When audit logging is enabled, audit events are logged when you interact with a Databricks repo. For example, an audit event is logged when you create, update, or delete a Databricks repo, when you list all Databricks Repos associated with a workspace, and when you sync changes between your Databricks repo and the remote Git repo.
Databricks Repos scans code for access key IDs that begin with the prefix
AKIA and warns the user before committing.
By default, the admin setting for Repos doesn’t allow
.ipynb notebook output to be committed. Workspace admins can change this setting:
Go to Admin settings > Workspace settings.
Under Repos > Allow Repos to Export IPYNB outputs, select Allow: IPYNB outputs can be toggled on.
When outputs are included, the visualization and dashboard configs are preserved with the .ipynb file format.
For information about configuring and committing
.ipynb notebook outputs, see Control `.ipynb` notebook output commits.
For information about supported notebook types, see Export and import Databricks notebooks.
You can add settings for each notebook to your repo in a
.databricks/commit_outputs file that you create manually.
Specify the notebook you want to include outputs using patterns similar to gitignore patterns.
The file contains positive and negative file path patterns. File path patterns include notebook file extension such as
Positive patterns enable outputs inclusion for matching notebooks.
Negative patterns disable outputs inclusion for matching notebooks.
Patterns are evaluated in order for all notebooks. Invalid paths or paths not resolving to
.ipynb notebooks are ignored.
To include outputs from a notebook path
folder/innerfolder/notebook.ipynb, use following patterns:
**/* folder/** folder/innerfolder/note*
To exclude outputs for a notebook, check that none of the positive patterns match or add a negative pattern in a correct spot of the configuration file. Negative (exclude) patterns start with
!folder/innerfolder/*.ipynb !folder/**/*.ipynb !**/notebook.ipynb
To delete a repository from your workspace:
Right-click the repository, and then select Move to trash.
In the dialog box, type the name of the repo you want to delete. Then, click Confirm & move to trash.