CI/CD workflows with Databricks Repos and Git integration

Learn best practices for using Databricks Repos in a CI/CD workflow. Integrating Git repos with Databricks Repos provides source control for project files.

The following figure shows an overview of the steps.

Best practices overview

Admin workflow

Databricks Repos have user-level folders and non-user top level folders. User-level folders are automatically created when users first clone a remote repository. You can think of Databricks Repos in user folders as “local checkouts” that are individual for each user and where users make changes to their code.

Set up top-level folders

Admins can create non-user top level folders. The most common use case for these top level folders is to create Dev, Staging, and Production folders that contain Databricks Repos for the appropriate versions or branches for development, staging, and production. For example, if your company uses the Main branch for production, the Production folder would contain Repos configured to be at the Main branch.

Typically permissions on these top-level folders are read-only for all non-admin users within the workspace.

Top-level repo folders

Set up Git automation to update Databricks Repos on merge

To ensure that Databricks Repos are always at the latest version, you can set up Git automation to call the Repos API 2.0. In your Git provider, set up automation that, after every successful merge of a PR into the main branch, calls the Repos API endpoint on the appropriate repo in the Production folder to bring that repo to the latest version.

For example, on GitHub this can be achieved with GitHub Actions.

Developer workflow

In your user folder in Databricks Repos, clone your remote repository. A best practice is to create a new feature branch or select a previously created branch for your work, instead of directly committing and pushing changes to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to merge your code, create a pull request and follow the review and merge processes in Git.

Here is an example workflow.

Requirements

This workflow requires that you have already set up your Git integration.

Note

Databricks recommends that each developer work on their own feature branch. Sharing feature branches among developers can cause merge conflicts, which must be resolved using your Git provider. For information about how to resolve merge conflicts, see Resolve merge conflicts.

Workflow

  1. Clone your existing Git repository to your Databricks workspace.

  2. Use the Repos UI to create a feature branch from the main branch. This example uses a single feature branch feature-b for simplicity. You can create and use multiple feature branches to do your work.

  3. Make your modifications to Databricks notebooks and files in the Repo.

  4. Commit and push your changes to your Git provider.

  5. Coworkers can now clone the Git repository into their own user folder.

    1. Working on a new branch, a coworker makes changes to the notebooks and files in the Repo.

    2. The coworker commits and pushes their changes to the Git provider.

  6. To merge changes from other branches or rebase the feature branch, you must use the Git command line or an IDE on your local system. Then, in the Repos UI, use the Git dialog to pull changes into the feature-b branch in the Databricks Repo.

  7. When you are ready to merge your work to the main branch, use your Git provider to create a PR to merge the changes from feature-b.

  8. In the Repos UI, pull changes to the main branch.

Production job workflow

You can point a job directly to a notebook in a Databricks Repo. When a job kicks off a run, it uses the current version of the code in the repo.

If the automation is setup as described in Admin workflow, every successful merge calls the Repos API to update the repo. As a result, jobs that are configured to run code from a repo always use the latest version available when the job run was created.

Migration tips

Preview

This feature is in Public Preview.

If you are using %run commands to make Python or R functions defined in a notebook available to another notebook, or are installing custom .whl files on a cluster, consider including those custom modules in a Databricks repo. In this way, you can keep your notebooks and other code modules in sync, ensuring that your notebook always uses the correct version.

Migrate from %run commands

%run commands let you include one notebook within another and are often used to make supporting Python or R code available to a notebook. In this example, a notebook named power.py includes the code below.

# This code is in a notebook named "power.py".
def n_to_mth(n,m):
  print(n, "to the", m, "th power is", n**m)

You can then make functions defined in power.py available to a different notebook with a %run command:

# This notebook uses a %run command to access the code in "power.py".
%run ./power
n_to_mth(3, 4)

Using Files in Repos, you can directly import the module that contains the Python code and run the function.

from power import n_to_mth
n_to_mth(3, 4)

Migrate from installing custom Python .whl files

You can install custom .whl files onto a cluster and then import them into a notebook attached to that cluster. For code that is frequently updated, this process is cumbersome and error-prone. Files in Repos lets you keep these Python files in the same repo with the notebooks that use the code, ensuring that your notebook always uses the correct version.

For more information about packaging Python projects, see this tutorial.