Limits & FAQ for Git integration with Databricks Repos
Databricks Repos and Git integration have limits specified in the following sections. For general information, see Databricks limits.
File and repo size limits
Databricks doesn’t enforce a limit on the size of a repo. However:
Working branches are limited to 200 MB.
Individual files are limited to 200 MB.
Files larger than 10 MB can’t be viewed in the Databricks UI.
Databricks recommends that in a repo:
The total number of all files not exceed 10,000.
The total number of notebooks not exceed 5,000.
You might receive an error message if your repo exceeds these limits. You might also receive a timeout error when you clone the repo, but the operation might complete in the background.
Where is Databricks repo content stored?
The contents of a repo are temporarily cloned onto disk in the control plane. Databricks notebook files are stored in the control plane database just like notebooks in the main workspace. Non-notebook files are stored on disk for up to 30 days.
Does Repos support on-premises or self-hosted Git servers?
Databricks Repos supports Bitbucket Server integration, if the server is internet accessible.
To integrate with a Bitbucket Server, GitHub Enterprise Server, or a GitLab self-managed subscription instance that is not internet-accessible, get in touch with your Databricks representative.
Does Repos support
Yes. If you add a file to your repo and do not want it to be tracked by Git, create a
.gitignore file or use one cloned from your remote repository and add the filename, including the extension.
.gitignore works only for files that are not already tracked by Git. If you add a file that is already tracked by Git to a
.gitignore file, the file is still tracked by Git.
Can I create top-level folders that are not user folders?
Yes, admins can create top-level folders to a single depth. Repos does not support additional folder levels.
Does Repos support Git submodules?
No. You can clone a repo that contains Git submodules, but the submodule is not cloned.
How can I disable Repos in my workspace?
Follow these steps to disable Repos for Git in your workspace.
Go to the admin settings page.
Click the Workspace Settings tab.
In the Advanced section, click the Repos toggle.
Refresh your browser.
Why do notebook dashboards disappear when I pull or checkout a different branch?
This is currently a limitation because Databricks notebook source files do not store notebook dashboard information.
Can I pull in IPYNB notebook files?
This feature is in Public Preview.
Yes. Support for Jupyter notebooks (.ipynb files) is available in Repos. You can clone repositories with .ipynb notebooks, work in Databricks UI, and then commit and push as .ipynb notebooks. Metadata such as the notebook dashboard is preserved. Admins can control whether outputs can be committed or not.
You can also:
Create new .ipynb notebooks.
Convert notebooks to .ipynb file format.
View diffs as Code diff (code changes in cells) or Raw diff (code changes in JSON, including notebook outputs as metadata).
Does Repos support branch merging?
No. Databricks recommends that you create a pull request and merge through your Git provider.
Can I delete a branch from a Databricks repo?
No. To delete a branch, you must work in your Git provider.
If a library is installed on a cluster, and a library with the same name is included in a folder within a repo, which library is imported?
The library in the repo is imported.
Can I pull the latest version of a repository from Git before running a job without relying on an external orchestration tool?
No. Typically you can integrate this as a pre-commit on the Git server so that every push to a branch (main/prod) updates the Production repo.
Can I export a repo?
You can export notebooks, folders, or an entire repo. You cannot export non-notebook files, and if you export an entire repo, non-notebook files are not included. To export, use the Workspace CLI or the Workspace API.
Security, authentication, and tokens
Are the contents of Databricks repos encrypted?
The contents of Databricks repos are encrypted by Databricks using a platform-managed key. Encryption using Customer-managed keys is not supported.
How and where are the GitHub tokens stored in Databricks? Who would have access from Databricks?
The authentication tokens are stored in the Databricks control plane, and a Databricks employee can only gain access through a temporary credential that is audited.
Databricks logs the creation and deletion of these tokens, but not their usage. Databricks has logging that tracks Git operations that can be used to audit the usage of the tokens by the Databricks application.
GitHub enterprise audits token usage. Other Git services might also have Git server auditing.
CI/CD and MLOps
Incoming changes clear the notebook state
Git operations that alter the notebook source code result in the loss of the notebook state, including cell outputs, comments, revision history, and widgets. For example, Git pull can change the source code of a notebook. In this case, Databricks Repos must overwrite the existing notebook to import the changes. Git commit and push or creating a new branch do not affect the notebook source code, so the notebook state is preserved in these operations.
Prevent data loss in MLflow experiments
MLflow experiment data in a notebook might be lost in this scenario: You rename the notebook and then, before calling any MLflow commands, change to a branch that doesn’t contain the notebook.
To prevent this situation, Databricks recommends you avoid renaming notebooks in repos.
Can I create an MLflow experiment in a repo?
No. You can only create an MLflow experiment in the workspace. Experiments created in a repo before the 3.72 platform release are no longer supported, though they may continue to work without guarantees. Databricks recommends exporting existing experiments in repos to workspace experiments using the MLflow export tool.
What happens if a job starts running on a notebook while a Git operation is in progress?
At any point while a Git operation is in progress, some notebooks in the repo might have been updated while others have not. This can cause unpredictable behavior.
For example, suppose notebook A calls notebook Z using a
%run command. If a job running
during a Git operation starts the most recent version of notebook A, but notebook Z has not
yet been updated, the
%run command in notebook A might start the older version of notebook Z.
During the Git operation, the notebook states are not predictable and the job might fail or run
notebook A and notebook Z from different commits.
Non-notebook files: Files in Repos
Files in Repos supports non-notebook solution files in Databricks Repos.
In Databricks Runtime 10.1 and below, Files in Repos is not compatible with Spark Streaming. To use Spark Streaming on a cluster running Databricks Runtime 10.1 or below, you must disable Files in Repos on the cluster. Set the Spark configuration
Only text-encoded files are rendered in the UI. To view files in Databricks, the files must not be larger than 10 MB.
You cannot create or edit a file from your notebook.
You can only export notebooks. You cannot export non-notebook files from a repo.
File operations in Scala not supported
For Files in Repos, file operations in Scala are not supported. You might see errors like
error: not found: value Try:
How can I run non-Databricks notebook files in a repo? For example, a
You can use any of the following:
Bundle and deploy as a library on the cluster.
Pip install the Git repository directly. This requires a credential in secrets manager.
%runwith inline code in a notebook.
Use a custom container image. See Customize containers with Databricks Container Services.