Limits and FAQ for Git integration with Databricks Git folders

Databricks Git folders and Git integration have limits specified in the following sections. For general information, see Databricks limits.

Jump to:

File and repo limits
FAQ: Git folder configuration

To learn about Databricks asset types supported in Git folders, see What asset types are supported by Git folders?.

File and repo limits

Databricks doesn't enforce a limit on the size of a repo. However:

Working branches are limited to 1 gigabyte (GB).
Files larger than 10 MB can't be viewed in the Databricks UI.
Individual workspace files are subject to a separate size limit. For more details, read Limitations.
The local version of a branch can remain present in the associated Git folder for up to 30 days after the remote branch is deleted. To completely remove a local branch in a Git folder, delete the repository.

Databricks recommends that in a repo:

The total number of workspace assets and files does not exceed 20,000.

For any Git operation, memory usage is limited to 2 GB and disk writes are limited to 4 GB. Because the limit is per operation, you get a failure if you attempt to clone a Git repo that is 5 GB in size. However, if you clone a Git repo that is 3 GB in size in one operation and then add 2 GB to it later, the next pull operation will succeed.

If your repo exceeds these limits, you might receive an error message. You might also receive a timeout error when cloning the repo, even though the operation might still complete in the background.

To work with a repo larger than the size limits, try sparse checkout.

If you must write temporary files that you do not want to keep after the cluster is shut down, writing the temporary files to $TEMPDIR avoids exceeding branch size limits and yields better performance than writing to a working directory (CWD) in the workspace filesystem. For more information, see Where should I write temporary files on Databricks?.

Recovering files deleted from Git folders in your workspace

Workspace actions on Git folders vary in file recoverability. Some actions allow recovery through the Trash folder, while others do not. Files previously committed and pushed to a remote branch can be restored using the Git commit history for the remote Git repository. This table outlines each action's behavior and recoverability:

Action	Is the file recoverable?
Delete file with workspace browser	Yes, from the Trash folder
Discard a new file with the Git folder dialog	Yes, from the Trash folder
Discard a modified file with the Git folder dialog	No, the file is gone
`reset` (hard) for uncommitted file modifications	No, file modifications are gone
`reset` (hard) for uncommitted, newly created files	No, file modifications are gone
Switch branches with the Git folder dialog	Yes, from the remote Git repo
Other Git operations, such as commit or push, from the Git folder dialog	Yes, from the remote Git repo
`PATCH` operations updating `/repos/id` from Repos API	Yes, from the remote Git repo

Monorepo support

Databricks recommends not creating Git folders backed by monorepos. A monorepo is a large, single-organization Git repository with thousands of files across many projects.

Frequently Asked Questions: Git folder configuration

Where is Databricks repo content stored?

The contents of a repo are temporarily cloned onto disk in the control plane. Databricks notebook files are stored in the control plane database just like notebooks in the main workspace. Non-notebook files are stored on disk for up to 30 days.

Do Git folders support on-premises or self-hosted Git servers?

Databricks Git folders supports GitHub Enterprise, Bitbucket Server, Azure DevOps Server, and GitLab Self-managed integration, if the server is internet accessible. For details on integrating Git folders with an on-premises Git server, read Git Proxy Server for Git folders.

To integrate with a Bitbucket Server, GitHub Enterprise Server, or a GitLab self-managed subscription instance that is not internet-accessible, contact your Databricks account team.

What Databricks asset types are supported by Git folders?

For details on supported artifact types, see What asset types are supported by Git folders?.

Do Git folders support `.gitignore` files?

Yes. If you add a file to your repo and do not want it to be tracked by Git, create a .gitignore file or use one cloned from your remote repository and add the filename, including the extension.

.gitignore works only for files not already tracked by Git. If you add a file already tracked by Git to a .gitignore file, the file is still tracked by Git.

Do Git folders support Git submodules?

No. You can clone a repo that contains Git submodules, but the submodule is not cloned.

Source management

Why do notebook dashboards disappear when I pull or checkout a different branch?

This is a limitation because Databricks source format notebooks don't store notebook dashboard information.

To preserve dashboards in the Git repository, change the notebook format to .ipynb (the Jupyter notebook format). By default, .ipynb supports dashboard and visualization definitions. To preserve visualization data, you must commit the notebook with outputs.

To learn about committing .ipynb notebook outputs, see Manage IPYNB notebook output commits.

Do Git folders support branch merging?

Yes. You can also create a pull request and merge through your Git provider.

Can I delete a branch from a Databricks repo?

No. To delete a branch, you must work in your Git provider.

What's the precedence order when Python dependencies are included in Git folders?

Python libraries stored in a Git folder take precedence over libraries stored elsewhere. For example, If a library is installed on your Databricks compute, and a library with the same name is included in a Git folder, the library in the Git folder is imported. For more information about library precedence in Python, see Python library precedence.

Security, authentication, and tokens

Are the contents of Databricks Git folders encrypted?

The contents of Databricks Git folders are encrypted by Databricks using a default key. Encryption using customer-managed keys is not supported except when encrypting your Git credentials.

How and where are the GitHub tokens stored in Databricks? Who would have access from Databricks?

The authentication tokens are stored in the Databricks control plane, and a Databricks employee can only gain access through a temporary credential that is audited.
Databricks logs the creation and deletion of these tokens but not their usage. Databricks has logging that tracks Git operations that can be used to audit the usage of the tokens by the Databricks application.
GitHub enterprise audits token usage. Other Git services might also have Git server auditing.

Does Git folders support GPG signing of commits?

No.

Does Git folders support Git operations using SSH?

No, only the HTTPS protocol is supported.

CI/CD and MLOps

Incoming changes clear the notebook state

Git operations that alter the notebook source code result in the loss of the notebook state, including cell outputs, comments, version history, and widgets. For example, git pull can change the source code of a notebook. In this case, Databricks Git folders must overwrite the existing notebook to import the changes. git commit and push or creating a new branch does not affect the notebook source code, so the notebook state is preserved in these operations.

important

MLflow experiments don't work in Git folders with DBR 14.x or lower versions.

Can I create an MLflow experiment in a Git folder?

There are two types of MLflow experiments: workspace and notebook. For details on the two types of MLflow experiments, see Organize training runs with MLflow experiments.

Workspace experiments: You cannot create workspace MLflow experiments in Git folders. Instead, log MLflow runs to an MLflow experiment created in a regular workspace folder. If multiple users use separate Git folders to collaborate on the same code, log MLflow runs to an MLflow experiment created in a shared workspace folder.
Notebook experiments: You can create notebook experiments in a Databricks Git folder. If you check your notebook into source control as an .ipynb file, you can log MLflow runs to an automatically created and associated MLflow experiment. However, the experiment and the associated runs are not checked into source control. To learn more, see creating notebook experiments.

Prevent data loss in MLflow experiments

Notebook MLflow experiments created using Lakeflow Jobs with source code in a remote repository are stored in a temporary storage location. These experiments persist initially after workflow execution but are at risk of deletion later during scheduled removal of files in temporary storage. Databricks recommends using workspace MLflow experiments with Jobs and remote Git sources.

warning

Any time you switch to a branch that does not contain the notebook, you risk losing the associated MLflow experiment data. This loss becomes permanent if the prior branch is not accessed within 30 days.

To recover missing experiment data before the 30-day expiry, change the notebook name to the original name, open the notebook, and click on the right side pane, which triggers a call to the mlflow.get_experiment_by_name() function. When the function returns, you can see the recovered experiment and runs. After 30 days, any orphaned MLflow experiments will be purged to meet GDPR compliance policies.

To prevent this situation, Databricks recommends not renaming notebooks in a repo. If you do rename a notebook, click the “experiment” icon on the right side pane immediately after renaming the notebook.

What happens if a notebook job is running in a workspace while a Git operation is in progress?

At any point that a Git operation is in progress, some notebooks in the repo might have been updated while others have not. This can cause unpredictable behavior.

For example, suppose notebook A calls notebook Z using a %run command. If a job running during a Git operation starts the most recent version of notebook A, but notebook Z has not yet been updated, the %run command in notebook A might start the older version of notebook Z. During the Git operation, the notebook states are not predictable, and the job might fail or run notebook A and notebook Z from different commits.

To avoid this situation, configure your job tasks to use your Git provider as the source and not a workspace path. To learn more, see Use Git with jobs.

Resources

For details on Databricks workspace files, see What are workspace files?.

File and repo limits​

Recovering files deleted from Git folders in your workspace​

Monorepo support​

Frequently Asked Questions: Git folder configuration​

Where is Databricks repo content stored?​

Do Git folders support on-premises or self-hosted Git servers?​

What Databricks asset types are supported by Git folders?​

Do Git folders support .gitignore files?​

Do Git folders support Git submodules?​

Source management​

Why do notebook dashboards disappear when I pull or checkout a different branch?​

Do Git folders support branch merging?​

Can I delete a branch from a Databricks repo?​

What's the precedence order when Python dependencies are included in Git folders?​

Security, authentication, and tokens​

Are the contents of Databricks Git folders encrypted?​

How and where are the GitHub tokens stored in Databricks? Who would have access from Databricks?​

Does Git folders support GPG signing of commits?​

Does Git folders support Git operations using SSH?​

CI/CD and MLOps​

Incoming changes clear the notebook state​

Can I create an MLflow experiment in a Git folder?​

Prevent data loss in MLflow experiments​

What happens if a notebook job is running in a workspace while a Git operation is in progress?​

Resources​