Databricks Repos and Git integration have limits specified in the following sections. For general information, see Databricks limits.
Databricks doesn’t enforce a limit on the size of a repo. However:
Working branches are limited to 200 MB.
Individual files are limited to 200 MB.
Files larger than 10 MB can’t be viewed in the Databricks UI.
Databricks recommends that in a repo:
The total number of all files not exceed 10,000.
The total number of notebooks not exceed 5,000.
For any Git operation, memory usage is limited to 2 GB, and disk writes are limited to 4 GB. Since the limit is per-operation, you get a failure if you attempt to clone a Git repo that is 5 GB in current size. However, if you clone a Git repo that is 3 GB in size in one operation and then add 2 GB to it later, the next pull operation will succeed.
You might receive an error message if your repo exceeds these limits. You might also receive a timeout error when you clone the repo, but the operation might complete in the background.
To work with repo larger than the size limits, try sparse checkout.
The contents of a repo are temporarily cloned onto disk in the control plane. Databricks notebook files are stored in the control plane database just like notebooks in the main workspace. Non-notebook files are stored on disk for up to 30 days.
Databricks Repos supports GitHub Enterprise, Bitbucket Server, Azure DevOps Server, and GitLab Self-managed integration, if the server is internet accessible. For details on integrating Repos with an on-prem Git server, read Git Proxy Server for Repos.
To integrate with a Bitbucket Server, GitHub Enterprise Server, or a GitLab self-managed subscription instance that is not internet-accessible, get in touch with your Databricks account team.
Only certain Databricks asset types are supported by Repos. In this case, “supported” means “can be serialized, version-controlled, and pushed to the backing Git repo.”
Currently, the supported asset types are:
Files are serialized data, and can include anything from libraries to binaries to code to images.
Notebooks are specifically the notebook file formats supported by Databricks. Notebooks are considered a separate Databricks asset type from Files since they are not serialized. Repos determines a Notebook by the file extension. (For example,
A folder is a Databricks-specific structure that represents serialized information about a logical grouping of files in Git. As expected, the user experiences this as a “folder” when viewing a Databricks Repo or accessing it with the Databricks CLI.
Potential Databricks asset types that are not supported in Databricks Repos today but may be supported in the future includes:
Dashboards (including Lakeview dashboards)
You can move existing unsupported assets into a Databricks Git folder, but cannot commit changes to these assets back to the repo. You cannot create new unsupported assets in a Databricks Git folder.
Yes. If you add a file to your repo and do not want it to be tracked by Git, create a
.gitignore file or use one cloned from your remote repository and add the filename, including the extension.
.gitignore works only for files that are not already tracked by Git. If you add a file that is already tracked by Git to a
.gitignore file, the file is still tracked by Git.
Yes, admins can create top-level folders to a single depth. Repos does not support additional folder levels.
This is currently a limitation because Databricks notebook source files don’t store notebook dashboard information.
If you want to preserve dashboards in the Git repository, change notebook format to ``````.ipynb` (the Jupyter notebook format). By default,
.ipynb supports dashboard and visualization definitions. If you want to preserve graph data (data points), you need to commit the notebook with outputs.
To learn about committing
.ipynb notebook outputs, see Allow committing `.ipynb` notebook output.
Yes. Support for Jupyter notebooks (
.ipynb files) is available in Repos. You can clone repositories with
.ipynb notebooks, work in Databricks UI, and then commit and push as
.ipynb notebooks. Metadata such as the notebook dashboard is preserved. Admins can control whether outputs can be committed or not.
You can also:
Convert notebooks to
.ipynbfile format. In the Databricks editor, go to the File menu and select Change notebook format.
View diffs as Code diff (code changes in cells) or Raw diff (code changes in
JSON, including notebook outputs as metadata).
For more information on the kinds of notebooks supported in Databricks, see Export and import Databricks notebooks.
Yes. You can also create a pull request and merge through your Git provider.
No. To delete a branch, you must work in your Git provider.
If a library is installed on a cluster, and a library with the same name is included in a folder within a repo, which library is imported?
The library in the repo is imported. For more information about library precedence in Python, see Python library precedence.
Can I pull the latest version of a repository from Git before running a job without relying on an external orchestration tool?
No. Typically you can integrate this as a pre-commit on the Git server so that every push to a branch (main/prod) updates the Production repo.
When you try to clone a repo, you might get a “denied access” error message when:
Databricks is configured to use Azure DevOps with Microsoft Entra ID authentication.
You have enabled a conditional access policy in Azure DevOps and an Microsoft Entra ID conditional access policy.
To resolve this, add an exclusion to the conditional access policy (CAP) for the IP address or users of Databricks.
For more information, see Conditional access policies.
The contents of Databricks Repos are encrypted by Databricks using a default key. Encryption using customer-managed keys is not supported except when encrypting your Git credentials.
The authentication tokens are stored in the Databricks control plane, and a Databricks employee can only gain access through a temporary credential that is audited.
Databricks logs the creation and deletion of these tokens, but not their usage. Databricks has logging that tracks Git operations that can be used to audit the usage of the tokens by the Databricks application.
GitHub enterprise audits token usage. Other Git services might also have Git server auditing.
Git operations that alter the notebook source code result in the loss of the notebook state, including cell outputs, comments, version history, and widgets. For example,
git pull can change the source code of a notebook. In this case, Databricks Repos must overwrite the existing notebook to import the changes.
git commit and
push or creating a new branch do not affect the notebook source code, so the notebook state is preserved in these operations.
No. You cannot create workspace MLflow experiments in a Git folder. You can log MLflow runs to notebook MLflow experiments but those runs will not be checked into source control. If multiple users use separate Git folders to collaborate on the same ML code, log MLflow runs to an MLflow experiment created in a regular Workspace folder.
At any point while a Git operation is in progress, some notebooks in the repo might have been updated while others have not. This can cause unpredictable behavior.
For example, suppose
notebook A calls
notebook Z using a
%run command. If a job running
during a Git operation starts the most recent version of
notebook A, but
notebook Z has not
yet been updated, the
%run command in notebook A might start the older version of
During the Git operation, the notebook states are not predictable and the job might fail or run
notebook A and
notebook Z from different commits.
To avoid this situation, use Git-based jobs (where the source is a Git provider and not a workspace path) instead. See _
For details on Databricks workspace files, see What are workspace files?.