Use Databricks Git folders in CI/CD
Learn techniques for using Databricks Git folders in CI/CD workflows. By configuring Databricks Git folders in the workspace, you can use source control for project files in Git repositories and you can integrate them into your data engineering pipelines.
Most of the work in developing automation for Git folders is in the initial configuration for your folders and in understanding the Databricks Repos REST API you use to automate Git operations from Databricks jobs. Before you start building your automation and setting up folders, review the remote Git repositories you will incorporate into your automation flows and select the right ones for the different stages of your automation, including development, integration, staging, and production.
The following figure provides an overview that describes the basic flows for automation using Databricks Git folders. In it, there are production folders, user folders, and automation built using Databricks jobs that run notebook code when triggered.
- Admin workflow: For production flows, a Databricks workspace admin must set up top-level folders in your workspace. The admin clones a Git repository and branch when creating them, and should give these folders meaningful names such as "Production", "Test", or "Staging" which correspond to the remote Git repositories' purpose in your development flows. For more details, see Production Git folders in this topic.
- User workflow: A user can create a Git folder under
/Volumes/Users/<email>/
based on a remote Git repository. A user must create a local user-specific branch for the work the user will commit to it and push to the remote repository. For information on collaborating in user-specific Git folders, see Collaborate using Git folders in this topic. - Merge workflow: Users can create pull requests (PRs) from any folder, which can trigger Databricks jobs automation to call the Databricks Repos API to test and process their changes. These changes can be pushed to the "main" branches of your production Git folders, also by using the Databricks Repos API.
For a more comprehensive overview of CI/CD with Databricks, see CI/CD on Databricks.
Choose a Git folder configuration
There are two types of Databricks Git folders, differentiated by their usage pattern and locations in the workspace:
-
User-level folders. When a user clones a remote repository in the Databricks UI to create a Git folder, the Git folder is created in their personal folder under
/Workspace/Users/<email>/
by default. User-level folders are used for individual development. You can think of Databricks Git folders in user folders as “local checkouts” that are specific to each user and where users make and push changes to their code. -
Production folders. A production folder is created outside of user workspaces by a Databricks admin and hosts a production branch from the backing Git repository. It is primarily used for deployment automation and must only be updated when a PR is merged into the backing branch.
-
If you are not yet ready to adopt Databricks Asset Bundles but want your code to be source controlled, you can set up a production Git folder. You use automation to keep it up-to-date with the remote Git repo and branch. This can be done in one of two ways:
- Use external CI/CD tools such as GitHub Actions to pull the latest commits to the production Git folder and create a merge request to the remote Git repo and branch.
- Create scheduled jobs to refresh the other Git folders in your workspace with the current state of the production Git folder.
Databricks Asset Bundles, which allow you to define resources such as jobs and pipelines in source files, can be created, deployed, and managed in a Git folder in the workspace. See Collaborate on bundles in the workspace.
Collaborate using Git folders
You can easily collaborate with others using Git folders, pulling updates and pushing changes directly from the Databricks UI. For example, use a feature or development branch to aggregate changes made across multiple contributor branches.
The following flow describes how to collaborate using a feature branch named feature-b
based on the main
branch.
- Clone your existing Git repository to your Databricks workspace.
- Use the Git folders UI to create a feature branch from the main branch. This example uses a single feature branch
feature-b
for simplicity. You can create and use multiple feature branches to do your work. - Make your modifications to Databricks notebooks and other files in the repo.
- Commit and push your changes to the remote Git repository.
- Contributors can now clone the Git repository into their own user folder.
- Working on a new branch, a coworker makes changes to the notebooks and other files in the Git folder.
- The contributor commits and pushes their changes to the remote Git repository.
- To merge changes from other branches or rebase the
feature-b
branch in the Git folders UI in Databricks, use one of the following flows:- Merge branches. If there's no conflict, the merge is pushed to the remote Git repository using
git push
. - Rebase on another branch.
- Merge branches. If there's no conflict, the merge is pushed to the remote Git repository using
- When you are ready to merge your work to the remote Git repository and
main
branch, use the Git folders UI to merge the changes fromfeature-b
. If you prefer, you can instead merge changes directly to the Git repository backing your Git folder.
Databricks recommends that each developer works on their own feature branch. For information about how to resolve merge conflicts, see Resolve merge conflicts.
Production Git folders
A production folder is a Git folder designated for production and deployment. You can create it in a user folder or as a team folder under \Workspace
, for example, \Workspace\Production
. Databricks recommends creating a production-specific Git folder.
Production folders are created from remote repositories with branches that serve to aggregate changes or represent different stages in your build and deployment automation, and have permissions that restrict users from making commits to the content directly. For production Git folders, limit user access to read-only and allow only admins and Databricks service principals to edit its contents. The remote Git repository and branch that backs it should be one designated specifically for production-ready code and assets.
To create a Git production folder:
-
Choose a Git repository and branch that you use specifically for production and deployment of your code and assets. Configure service principal authentication for the repository and limit any Git user privileges so the sources cannot be easily altered by users outside of your organization.
-
Create a Databricks Git folder for the Git repository and branch under
Workspace
or a subfolder dedicated to a project or team. Give it an easily identifiable name such as "Production" or "Deployment". -
Select Share after selecting the folder or Share (Permissions) by right-clicking on the folder under the Workspace tree. Configure the Git folder with the following permissions:
- Set Can Run for any project users that must run notebooks or other code from that folder.
- Set Can Run for any Databricks service principal accounts that will run automation for it.
- If appropriate for your project, set Can View for all users in the workspace to encourage discovery and sharing.
-
Select Add.
Production job workflow
Databricks Git folders provides two options for running your production jobs:
- Option 1: Provide a remote Git reference in the job definition. For example, run a specific notebook in the
main
branch of a Git repository. - Option 2: Set up a production Git repository and call Repos APIs to update it programmatically. Run jobs against the Databricks Git folder that clones this remote repository. The Repos API call should be the first task in the job.
Option 1: Run jobs using notebooks in a remote repository
Simplify the job definition process and keep a single source of truth by running a Databricks job using notebooks located in a remote Git repository. This Git reference can be a Git commit, tag, or branch and is provided by you in the job definition.
This helps prevent unintentional changes to your production job, such as when a user makes local edits in a production repository or switches branches. It also automates the CD step as you do not need to create a separate production Git folder in Databricks, manage permissions for it, and keep it updated.
See Use Git with jobs.
Option 2: Set up a production Git folder and Git automation
In this option, you set up a production Git folder and automation to update the Git folder on merge.
Step 1: Set up top-level folders
First, have a Databricks administrator create top-level folders in your workspace to contain individual Git folders for your development, staging, and production branches. For example, if your company uses the main
branch for production, the production Git folder must be configured to use the main
branch.
Typically, permissions on these top-level folders are read-only for all non-admin users within the workspace. For these top-level folders Databricks recommends you only provide service principal(s) with Can Edit and Can Manage permissions to avoid accidental edits to your production code by workspace users.
Step 2: Set up automated updates to Databricks Git folders using the API
To keep a Git folder in Databricks up to date with the latest version of your source files and assets, you can set up Git automation to call the Repos API. Using your Git provider's CI/CD tools (or compatible 3rd-party ones), set up automation that calls the Repos API endpoint on the appropriate Git folder to update it to the latest version of your sources after every successful merge of a PR into the main branch. For example, you can use GitHub Actions if GitHub is your provider.
For more information about the Databricks Repos API, see the Databricks REST API documentation for Repos.