This feature is in Public Preview.
You can sync your work in Databricks with a remote Git repository. This makes it easier to implement development best practices. Databricks supports integrations with GitHub, Bitbucket, and GitLab.
Repos are folders whose contents are co-versioned together by syncing them to a remote Git repository. Repos can contain only Databricks notebooks and sub-folders. The linked Git repository can contain other files, but they won’t appear in the Databricks workspace.
- Repos is supported on standard AWS deployments. HIPAA-compliant deployments are not supported.
- GitHub, Bitbucket, and GitLab are supported as Git providers, provided your Git server is accessible from the Databricks control plane. Private Git servers, such as Git servers behind a VPN, are not supported.
- Go to the Admin Console.
- Select the Advanced tab.
- Click the Enable button next to Repos.
- Click Confirm. You may need to refresh your browser to see the new icon.
When your workspace is enabled for Repos, you’ll see the Repos icon in your workspace’s sidebar.
- Click the profile icon in your Databricks workspace and select User Settings from the menu.
- On the User Settings page, go to the Git Integration tab.
- Follow the instructions for integration with GitHub, Bitbucket Cloud, or GitLab.
- If your organization has SAML SSO enabled in GitHub, ensure that you have authorized your personal access token for SSO.
After you have created a repo, you can develop notebooks in the repo and sync with your remote Git repository.
To create a new notebook or folder in a repo, click the down arrow next to the repo name, and select Create > Notebook or Create > Folder from the menu.
To move an notebook or folder in your workspace into a repo, navigate to the notebook or folder and select Move from the drop-down menu:
In the dialog, select the repo to which you want to move the object:
You can import SQL and Python files as single-cell Databricks notebooks.
- Add the comment line
-- Databricks notebook sourceat the top of a SQL file.
- Add the comment line
## Databricks notebook sourceat the top of a Python file.
To sync with Git, use the Git dialog. The Git dialog lets you pull changes from your remote Git repository and push and commit changes. You can also change the branch you are working on or create a new branch.
Performing Git operations using the dialog clears notebook comments and revision history. For more information, see Limitations and FAQ.
You can access the Git dialog from a notebook or from the repos browser.
From a notebook, click the button at the top left of the notebook that identifies the current Git branch.
From the repos browser, you can click the button next to the repo name:
You can also click the down arrow next to the repo name, and select Git… from the menu.
To pull changes from the remote Git repository, click in the Git dialog. Notebooks are updated automatically to the latest version in your remote repository.
A message appears if there are merge conflicts. Databricks recommends that you resolve the merge conflict using your Git provider interface.
When you have added new notebooks or made changes to existing notebooks, the Git dialog indicates the files that have changed.
Add a required Summary of the changes, and click Commit & Push to push these changes to the remote Git repository.
If you don’t have permission to commit to the master branch, create a new branch and use your Git provider interface to create a pull request (PR) to merge it into the master branch.
If there are merge conflicts, Databricks recommends that you create a new branch, commit and push your changes to that branch, work in your own branch, and resolve the merge conflict using your Git provider interface.
When you create a repo, you have Can Manage permission. This lets you perform Git operations or modify the remote repository. You can clone public remote repositories without Git credentials (personal access token and username). To modify a public remote repository, or to clone or modify a private remote repository, you must have a Git provider username and personal access token with read and write permissions for the remote repository.
This feature is in Private Preview. To try it, reach out to your Databricks contact.
The Repos API update endpoint allows you to update a repo to the latest version of a specific Git branch. This enables you to update the repo before you run a job against a notebook in the repo. For more information about this private preview feature, contact your Databricks representative.
This section includes best practices for integrating Databricks repos with your CI/CD workflow. The following figure shows an overview of the steps.
Repos have user-level folders and non-user top level folders. User-level folders are automatically created when users first clone a remote repository. You can think of repos in user folders as “local checkouts” that are individual for each user and where users make changes to their code.
Admins can create non-user top level folders. The most common use case for these top level folders is to create Dev, Staging, and Production folders that contain repos for the appropriate versions or branches for development, staging, and production. For example, if your company uses the Main branch for production, the Production folder would contain repos configured to be at the Main branch.
Typically permissions on these top-level folders are read-only for all non-admin users within the workspace.
To ensure that repos are always at the latest version, you can set up Git automation to call the Repos API. In your Git provider, set up automation that, after every successful merge of a PR into the Main branch, calls the Repos API endpoint on the appropriate repo in the Production folder to bring that repo to the latest version.
For example, on GitHub this can be achieved with GitHub Actions.
To start a workflow, clone your remote repository into a user folder. A best practice is to create a new feature branch, or select a previously created branch, for your work, instead of directly committing and pushing changes to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to merge your code, create a pull request and follow the review and merge processes in Git.
You can point jobs directly to notebooks in repos. When a job kicks off a run, it uses the current version of the code in the repo.
If the automation is setup as described in Admin workflow, every successful merge calls the Repos API to update the repo. As a result, jobs that are configured to run code from a repo always use the latest version available when the job run was created.
This happens because the notebook is re-imported, so all notebooks cells in the workspace are deleted and re-imported.
- Libraries and MLflow experiments are not supported. You can use notebook experiments in repos.
- Non-notebook files such as .txt, .csv, .md, or .yaml files are not supported.
- The remote Git repository may contain other files, but they will not appear in Databricks.
- Databricks exports the notebook source for notebooks as
pyfor easier readability and diffing in your Git provider. However, those files have additional metadata to identify them as Databricks notebook source files. Arbitrary
pyfiles are not available or referencable.
- In Databricks Runtime 7.1 and above and Databricks Runtime 7.1 ML and above,
%pip installsupport allows you to access private repositories to load Python libraries into notebooks.
You can use any of the following:
- Bundle and deploy as a library on the cluster.
- Pip install the Git repository directly. This requires a credential in secrets manager.
%runwith inline code in a notebook.
- Use a custom container image. See Customize containers with Databricks Container Services.
Yes, admins can create top-level folders to a single depth. Repos does not support additional folder levels.
- The authentication tokens are stored in the Databricks control plane, and a Databricks employee can only gain access through a temporary credential that is audited.
- Databricks logs the creation and deletion of these tokens, but not their usage. Databricks has logging that tracks Git operations that could be used to audit the usage of the tokens by the Databricks application.
- Github enterprise audits token usage. Other Git services may also have Git server auditing.
Can I pull the latest version of a repository from Git before running a job without relying on an external orchestration tool?
No. Typically you can integrate this as a pre-commit on the Git server so that every push to a branch (main/prod) updates the Production repo.
Repo size is limited to 100MB. Working branches are limited to 30MB.
Databricks recommends no more than 200 notebooks in a repo. You may receive timeout errors with a large number of notebook files. You may also receive a timeout error on the initial clone of the repo, but the operation might complete in the background.
No. Databricks recommends that you create a pull request and merge through your Git provider.
The contents of repos are encrypted by Databricks using a default key. Encryption using Customer managed keys for notebooks is not supported.
Try the following:
Confirm that the settings in the Git integration tab (User Settings > Git Integration) are correct.
- You must enter both your Git provider username and token. Legacy Git integrations did not require a username, so you may need to add a username to work with repos.
Confirm that you have selected the correct Git provider in the Add Repo dialog.
Ensure your personal access token or app password has the correct repo access.
If SSO is enabled on your Git provider, authorize your tokens for SSO.
Test your token with command line Git. Both of these options should work:
git clone https://<username>:<personal-access-token>@github.com/<org>/<repo-name>.git
git clone -c http.sslVerify=false -c http.extraHeader='Authorization: Bearer <personal-access-token>' https://agile.act.org/
<link>: Secure connection to <link> could not be established because of SSL problems
This error occurs if your Git server is not accessible from the Databricks control plane. Private Git servers are not supported.