CI/CD workflows with Git integration and Databricks Repos

Learn best practices for using Databricks Repos in a CI/CD workflow. Integrating Git repos with Databricks Repos provides source control for project files.

The following figure shows an overview of the steps.

Best practices overview for Repos.

Development flow

Databricks Repos have user-level folders and non-user top level folders. User-level folders are automatically created when users first clone a remote repository. You can think of Databricks Repos in user folders as “local checkouts” that are individual for each user and where users make changes to their code.

In your user folder in Databricks Repos, clone your remote repository. A best practice is to create a new feature branch or select a previously created branch for your work, instead of directly committing and pushing changes to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to merge your code, create a pull request and follow the review and merge processes in your Git provider.

Here is an example workflow.

Requirements

This workflow requires that you have already set up your Git integration.

Note

Databricks recommends that each developer work on their own feature branch. Sharing feature branches among developers can cause merge conflicts, which must be resolved using your Git provider. For information about how to resolve merge conflicts, see Resolve merge conflicts.

Development workflow with Repos

  1. Clone your existing Git repository to your Databricks workspace.

  2. Use the Repos UI to create a feature branch from the main branch. This example uses a single feature branch feature-b for simplicity. You can create and use multiple feature branches to do your work.

  3. Make your modifications to Databricks notebooks and other files in the Repo.

  4. Commit and push your changes to your Git provider.

  5. Coworkers can now clone the Git repository into their own user folder.

    1. Working on a new branch, a coworker makes changes to the notebooks and other files in the Repo.

    2. The coworker commits and pushes their changes to the Git provider.

  6. To merge changes from other branches or rebase the feature branch, you must use the Git command line or an IDE on your local system. Then, in the Repos UI, use the Git dialog to pull changes into the feature-b branch in the Databricks Repo.

  7. When you are ready to merge your work to the main branch, use your Git provider to create a PR to merge the changes from feature-b.

  8. In the Repos UI, pull changes to the main branch.

Production job workflow

Databricks Repos provides two options for running your production jobs:

  • Option 1: Provide a remote Git ref in the job definition, for example, a specific notebook in main branch of a Github repository.

  • Option 2: Set up a production repo and use Repos APIs to update it programmatically. Then run jobs against this Databricks repo.

Option 1: Run jobs using notebooks in a remote repo

Simplify the job definition process and keep a single source of truth by running a Databricks job using notebooks, DBT tasks located in a remote Git repository. This Git reference can be a git commit, tag, or branch and is provided by you in the job definition.

This ensures that you can prevent unintentional changes to your production job, for example, when a user makes local edits in a production repo or switches branches. It also automates the CD step as you do not need to create a separate production repo in Databricks, manage permissions for it, and keep it updated.

See Run jobs using notebooks in a remote repo.

Option 2: Set up a production repo and Git automation

In this option, you set up a production repo and Git automation to update Databricks Repos on merge.

Step 1: Set up top-level folders

The admin creates non-user top-level folders. The most common use case for these top-level folders is to create development, staging, and production folders that contain Databricks Repos for the appropriate versions or branches for development, staging, and production. For example, if your company uses the Main branch for production, the production folder would contain a Repo that is checked out to the Main branch.

Typically permissions on these top-level folders are read-only for all non-admin users within the workspace. For such top-level folders we recommend you only provide service principal(s) with Can Edit and Can Manage permissions to avoid accidental edits to your production code by workspace users.

Top-level repo folders.

Step 2: Set up automated updates to Databricks Repos via the Repos API

In this step, use the Repos API to set up automation via the Repos API to update Databricks Repos upon a merge event.

To ensure that Databricks Repos are always at the latest version, you can set up Git automation to call the Repos API 2.0. In your Git provider, set up automation that—after every successful merge of a PR into the main branch—calls the Repos API endpoint on the appropriate repo in the Production folder to pull the changes and update that repo to the latest version.

For example, on GitHub this can be achieved with GitHub Actions.

Run jobs using a notebook in a Databricks Repo

You can now point a job directly to a notebook in a Databricks Repo. When a job kicks off a run, it uses the current version of the code in the repo.

If the automation is setup as described in Option 2: Set up a production repo and Git automation, every successful merge calls the Repos API to update the repo. As a result, jobs that are configured to run code from a repo always use the latest version available when the job was run.

Use a service principal with Databricks Repos

To execute the above mentioned workflows with service principals:

  1. Create a service principal with Databricks.

  2. Add the git credentials: Your Git provider PAT the service principal.

To set up service principals and then add Git provider credentials:

  1. Create <a service-principal> in your workspace with the SCIM API 2.0 (ServicePrincipals) for workspaces.

  2. Create <a access-token-sp> for <a service-principal> with the Token Management API 2.0.

  3. Add your Git provider credentials to your workspace with your <access-token-sp> and the Git Credentials API 2.0.

To call these three APIs, you can use tools such as curl, Postman, or Terraform. You cannot use the Databricks user interface.

To learn more service principals on Databricks, see Service principals for Databricks automation. For information about service principals and CI/CD, see Service principals for CI/CD.

Terraform integration

You can also manage Databricks Repos in a fully automated setup using Databricks Terraform provider and databricks_repo:

resource "databricks_repo" "this" {
  url = "https://github.com/user/demo.git"
}