CI/CD techniques with Git and Databricks Repos

Learn techniques for using Databricks Repos in CI/CD workflows. Integrating Git repos with Databricks Repos provides source control for project files.

The following figure shows an overview of the techniques and workflow.

Overview of CI/CD techniques for Repos.

For an overview of CI/CD with Databricks, see What is CI/CD on Databricks?.

Development flow

Databricks Repos have user-level folders and non-user top level folders. User-level folders are automatically created when users first clone a remote repository. You can think of Databricks Repos in user folders as “local checkouts” that are individual for each user and where users make changes to their code.

In your user folder in Databricks Repos, clone your remote repository. A best practice is to create a new feature branch or select a previously created branch for your work, instead of directly committing and pushing changes to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to merge your code, you can do so in the Repos UI.

Requirements

This workflow requires that you have already set up your Git integration.

Note

Databricks recommends that each developer works on their own feature branch. For information about how to resolve merge conflicts, see Resolve merge conflicts.

Collaborate in Repos

In the following workflow uses a branch called feature-b that is based on the main branch.

  1. Clone your existing Git repository to your Databricks workspace.

  2. Use the Repos UI to create a feature branch from the main branch. This example uses a single feature branch feature-b for simplicity. You can create and use multiple feature branches to do your work.

  3. Make your modifications to Databricks notebooks and other files in the repo.

  4. Commit and push your changes to your Git provider.

  5. Coworkers can now clone the Git repository into their own user folder.

    1. Working on a new branch, a coworker makes changes to the notebooks and other files in the Repo.

    2. The coworker commits and pushes their changes to the Git provider.

  6. To merge changes from other branches or rebase the feature-b branch in Databricks, in the Repos UI use one of the following workflows:

  7. When you are ready to merge your work to the remote repo and main branch, use the Repos UI to merge the changes from feature-b. If you prefer, you can instead merge changes in your Git provider.

Production job workflow

Databricks Repos provides two options for running your production jobs:

  • Option 1: Provide a remote Git ref in the job definition, for example, a specific notebook in main branch of a Github repository.

  • Option 2: Set up a production repo and use Repos APIs to update it programmatically. Then run jobs against this Databricks repo.

Option 1: Run jobs using notebooks in a remote repo

Simplify the job definition process and keep a single source of truth by running a Databricks job using notebooks located in a remote Git repository. This Git reference can be a Git commit, tag, or branch and is provided by you in the job definition.

This helps prevent unintentional changes to your production job, for example, when a user makes local edits in a production repo or switches branches. It also automates the CD step as you do not need to create a separate production repo in Databricks, manage permissions for it, and keep it updated.

See Use version controlled source code in a Databricks job.

Option 2: Set up a production repo and Git automation

In this option, you set up a production repo and Git automation to update Databricks Repos on merge.

Step 1: Set up top-level folders

The admin creates non-user top-level folders. The most common use case for these top-level folders is to create development, staging, and production folders that contain Databricks Repos for the appropriate versions or branches for development, staging, and production. For example, if your company uses the Main branch for production, the production folder would contain a repo that is checked out to the Main branch.

Typically permissions on these top-level folders are read-only for all non-admin users within the workspace. For such top-level folders we recommend you only provide service principal(s) with Can Edit and Can Manage permissions to avoid accidental edits to your production code by workspace users.

Top-level repo folders.

Step 2: Set up automated updates to Databricks Repos via the Repos API

In this step, use the Repos API to set up automation to update Databricks Repos upon a merge event.

To keep a repo in Databricks at the latest version, you can set up Git automation to call the Repos API. In your Git provider, set up automation that—after every successful merge of a PR into the main branch—calls the Repos API endpoint on the appropriate repo in the Production folder to pull the changes and update that repo to the latest version.

For example, on GitHub this can be achieved with GitHub Actions.

Use a service principal with Databricks Repos

To run the above mentioned workflows with service principals:

  1. Create a service principal with Databricks.

  2. Add the git credentials: Your Git provider PAT for the service principal.

To set up service principals and then add Git provider credentials:

  1. Create a Databricks service principal in your workspace with the Service Principals API.

  2. Create a Databricks access token for a Databricks service principal with the Token management API.

  3. Add your Git provider credentials to your workspace with your Databricks access token and the Git Credentials API.

To call these three APIs, you can use tools such as curl, Postman, or Terraform. You cannot use the Databricks user interface.

To learn more service principals on Databricks, see Manage service principals. For information about service principals and CI/CD, see Service principals for CI/CD.

Terraform integration

You can also manage Databricks Repos in a fully automated setup using Terraform and databricks_repo:

resource "databricks_repo" "this" {
  url = "https://github.com/user/demo.git"
}