Skip to main content

Use Databricks Git folders in CI/CD

Learn techniques for using Databricks Git folders in CI/CD workflows. By configuring Databricks Git folders in the workspace, you can use source control for project files in Git repositories and you can integrate them into your data engineering pipelines.

Most of the work in developing automation for Git folders is in the initial configuration for your folders and in understanding the Databricks Repos REST API you use to automate Git operations from Databricks jobs. Before you start building your automation and setting up folders, review the remote Git repositories you will incorporate into your automation flows and select the right ones for the different stages of your automation, including development, integration, staging, and production.

The following figure provides an overview that describes the basic flows for automation using Databricks Git folders. In it, there are production folders, user folders, and automation built using Databricks jobs that run notebook code when triggered.

Overview of CI/CD techniques for Git folders.

  • Admin workflow: For production flows, a Databricks workspace admin must set up top-level folders in your workspace. The admin clones a Git repository and branch when creating them, and should give these folders meaningful names such as "Production", "Test", or "Staging" which correspond to the remote Git repositories' purpose in your development flows. For more details, see Production Git folders in this topic.
  • User workflow: A user can create a Git folder under /Volumes/Users/<email>/ based on a remote Git repository. A user must create a local user-specific branch for the work the user will commit to it and push to the remote repository. For information on collaborating in user-specific Git folders, see Collaborate using Git folders in this topic.
  • Merge workflow: Users can create pull requests (PRs) from any folder, which can trigger Databricks jobs automation to call the Databricks Repos API to test and process their changes. These changes can be pushed to the "main" branches of your production Git folders, also by using the Databricks Repos API.

For a more comprehensive overview of CI/CD with Databricks, see CI/CD on Databricks.

Choose a Git folder configuration

There are two types of Databricks Git folders, differentiated by their usage pattern and locations in the workspace:

  • User-level folders. When a user clones a remote repository in the Databricks UI to create a Git folder, the Git folder is created in their personal folder under /Workspace/Users/<email>/ by default. User-level folders are used for individual development. You can think of Databricks Git folders in user folders as “local checkouts” that are specific to each user and where users make and push changes to their code.

  • Production folders. A production folder is created outside of user workspaces by a Databricks admin and hosts a production branch from the backing Git repository. It is primarily used for deployment automation and must only be updated when a PR is merged into the backing branch.

  • If you are not yet ready to adopt Databricks Asset Bundles but want your code to be source controlled, you can set up a production Git folder. You use automation to keep it up-to-date with the remote Git repo and branch. This can be done in one of two ways:

    1. Use external CI/CD tools such as GitHub Actions to pull the latest commits to the production Git folder and create a merge request to the remote Git repo and branch.
    2. Create scheduled jobs to refresh the other Git folders in your workspace with the current state of the production Git folder.
tip

Databricks Asset Bundles, which allow you to define resources such as jobs and pipelines in source files, can be created, deployed, and managed in a Git folder in the workspace. See Collaborate on bundles in the workspace.

Collaborate using Git folders

You can easily collaborate with others using Git folders, pulling updates and pushing changes directly from the Databricks UI. For example, use a feature or development branch to aggregate changes made across multiple contributor branches.

The following flow describes how to collaborate using a feature branch named feature-b based on the main branch.

  1. Clone your existing Git repository to your Databricks workspace.
  2. Use the Git folders UI to create a feature branch from the main branch. This example uses a single feature branch feature-b for simplicity. You can create and use multiple feature branches to do your work.
  3. Make your modifications to Databricks notebooks and other files in the repo.
  4. Commit and push your changes to the remote Git repository.
  5. Contributors can now clone the Git repository into their own user folder.
    1. Working on a new branch, a coworker makes changes to the notebooks and other files in the Git folder.
    2. The contributor commits and pushes their changes to the remote Git repository.
  6. To merge changes from other branches or rebase the feature-b branch in the Git folders UI in Databricks, use one of the following flows:
  7. When you are ready to merge your work to the remote Git repository and main branch, use the Git folders UI to merge the changes from feature-b. If you prefer, you can instead merge changes directly to the Git repository backing your Git folder.
note

Databricks recommends that each developer works on their own feature branch. For information about how to resolve merge conflicts, see Resolve merge conflicts.

Production Git folders

A production folder is a Git folder designated for production and deployment. You can create it in a user folder or as a team folder under \Workspace, for example, \Workspace\Production. Databricks recommends creating a production-specific Git folder.

Production folders are created from remote repositories with branches that serve to aggregate changes or represent different stages in your build and deployment automation, and have permissions that restrict users from making commits to the content directly. For production Git folders, limit user access to read-only and allow only admins and Databricks service principals to edit its contents. The remote Git repository and branch that backs it should be one designated specifically for production-ready code and assets.

Git production folders mapped to the main branch on a remote repository.

To create a Git production folder:

  1. Choose a Git repository and branch that you use specifically for production and deployment of your code and assets. Configure service principal authentication for the repository and limit any Git user privileges so the sources cannot be easily altered by users outside of your organization.

  2. Create a Databricks Git folder for the Git repository and branch under Workspace or a subfolder dedicated to a project or team. Give it an easily identifiable name such as "Production" or "Deployment".

  3. Select Share after selecting the folder or Share (Permissions) by right-clicking on the folder under the Workspace tree. Configure the Git folder with the following permissions:

    • Set Can Run for any project users that must run notebooks or other code from that folder.
    • Set Can Run for any Databricks service principal accounts that will run automation for it.
    • If appropriate for your project, set Can View for all users in the workspace to encourage discovery and sharing.

    The Sharing Git folder modal dialog window.

  4. Select Add.

Production job workflow

Databricks Git folders provides two options for running your production jobs:

  • Option 1: Provide a remote Git reference in the job definition. For example, run a specific notebook in the main branch of a Git repository.
  • Option 2: Set up a production Git repository and call Repos APIs to update it programmatically. Run jobs against the Databricks Git folder that clones this remote repository. The Repos API call should be the first task in the job.

Option 1: Run jobs using notebooks in a remote repository

Simplify the job definition process and keep a single source of truth by running a Databricks job using notebooks located in a remote Git repository. This Git reference can be a Git commit, tag, or branch and is provided by you in the job definition.

This helps prevent unintentional changes to your production job, such as when a user makes local edits in a production repository or switches branches. It also automates the CD step as you do not need to create a separate production Git folder in Databricks, manage permissions for it, and keep it updated.

See Use Git with jobs.

Option 2: Set up a production Git folder and Git automation

In this option, you set up a production Git folder and automation to update the Git folder on merge.

Step 1: Set up top-level folders

First, have a Databricks administrator create top-level folders in your workspace to contain individual Git folders for your development, staging, and production branches. For example, if your company uses the main branch for production, the production Git folder must be configured to use the main branch.

Typically, permissions on these top-level folders are read-only for all non-admin users within the workspace. For these top-level folders Databricks recommends you only provide service principal(s) with Can Edit and Can Manage permissions to avoid accidental edits to your production code by workspace users.

Top-level Git folders.

Step 2: Set up automated updates to Databricks Git folders using the API

To keep a Git folder in Databricks up to date with the latest version of your source files and assets, you can set up Git automation to call the Repos API. Using your Git provider's CI/CD tools (or compatible 3rd-party ones), set up automation that calls the Repos API endpoint on the appropriate Git folder to update it to the latest version of your sources after every successful merge of a PR into the main branch. For example, you can use GitHub Actions if GitHub is your provider.

For more information about the Databricks Repos API, see the Databricks REST API documentation for Repos.

Additional resources

Was this article helpful?