Learn techniques for using Databricks Repos in CI/CD workflows. Integrating Git repos with Databricks Repos provides source control for project files.
The following figure shows an overview of the techniques and workflow.
For an overview of CI/CD with Databricks, see What is CI/CD on Databricks?.
Databricks Repos have user-level folders and non-user top level folders. User-level folders are automatically created when users first clone a remote repository. You can think of Databricks Repos in user folders as “local checkouts” that are individual for each user and where users make changes to their code.
In your user folder in Databricks Repos, clone your remote repository. A best practice is to create a new feature branch or select a previously created branch for your work, instead of directly committing and pushing changes to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to merge your code, you can do so in the Repos UI.
This workflow requires that you have already set up your Git integration.
Databricks recommends that each developer works on their own feature branch. For information about how to resolve merge conflicts, see Resolve merge conflicts.
In the following workflow uses a branch called feature-b that is based on the main branch.
Use the Repos UI to create a feature branch from the main branch. This example uses a single feature branch feature-b for simplicity. You can create and use multiple feature branches to do your work.
Make your modifications to Databricks notebooks and other files in the repo.
Coworkers can now clone the Git repository into their own user folder.
Working on a new branch, a coworker makes changes to the notebooks and other files in the Repo.
The coworker commits and pushes their changes to the Git provider.
To merge changes from other branches or rebase the feature-b branch in Databricks, in the Repos UI use one of the following workflows:
When you are ready to merge your work to the remote repo and main branch, use the Repos UI to merge the changes from feature-b. If you prefer, you can instead merge changes in your Git provider.
Databricks Repos provides two options for running your production jobs:
Option 1: Provide a remote Git ref in the job definition, for example, a specific notebook in main branch of a Github repository.
Option 2: Set up a production repo and use Repos APIs to update it programmatically. Then run jobs against this Databricks repo.
Simplify the job definition process and keep a single source of truth by running a Databricks job using notebooks located in a remote Git repository. This Git reference can be a Git commit, tag, or branch and is provided by you in the job definition.
This helps prevent unintentional changes to your production job, for example, when a user makes local edits in a production repo or switches branches. It also automates the CD step as you do not need to create a separate production repo in Databricks, manage permissions for it, and keep it updated.
In this option, you set up a production repo and Git automation to update Databricks Repos on merge.
The admin creates non-user top-level folders. The most common use case for these top-level folders is to create development, staging, and production folders that contain Databricks Repos for the appropriate versions or branches for development, staging, and production. For example, if your company uses the Main branch for production, the production folder would contain a repo that is checked out to the Main branch.
Typically permissions on these top-level folders are read-only for all non-admin users within the workspace. For such top-level folders we recommend you only provide service principal(s) with Can Edit and Can Manage permissions to avoid accidental edits to your production code by workspace users.
In this step, use the Repos API to set up automation to update Databricks Repos upon a merge event.
To keep a repo in Databricks at the latest version, you can set up Git automation to call the Repos API. In your Git provider, set up automation that—after every successful merge of a PR into the main branch—calls the Repos API endpoint on the appropriate repo in the Production folder to pull the changes and update that repo to the latest version.
For example, on GitHub this can be achieved with GitHub Actions.
To run the above mentioned workflows with service principals:
Create a service principal with Databricks.
Add the git credentials: Your Git provider PAT for the service principal.
To set up service principals and then add Git provider credentials:
Create a Databricks service principal in your workspace with the Service Principals API.
Create a Databricks access token for a Databricks service principal with the Token management API.
Add your Git provider credentials to your workspace with your Databricks access token and the Git Credentials API.
To call these three APIs, you can use tools such as
curl, Postman, or Terraform. You cannot use the Databricks user interface.