Skip to main content

CI/CD on Databricks

Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering software in short, frequent cycles through the use of automation pipelines. CI/CD is common to software development, and is becoming increasingly necessary to data engineering and data science. By automating the building, testing, and deployment of code, development teams are able to deliver releases more reliably than with manual processes.

Common tools are available for developing CI/CD pipelines, but implementations and approaches from organization to organization may differ slightly due to unique aspects of each organization's software development lifecycle. This page provides information about the following approaches to CI/CD on Databricks, and pros and cons for each approach:

For an overview of CI/CD for machine learning projects on Databricks, see How does Databricks support CI/CD for machine learning?.

Databricks Asset Bundles are the recommended approach to CI/CD on Databricks. Use Databricks Asset Bundles to describe Databricks resources such as jobs and pipelines as source files, and bundle them together with other assets to provide an end-to-end definition of a deployable project. These bundles of files can be source controlled, and you can use external CI/CD automation such as Github Actions to trigger deployments.

Pros

Cons

  • Includes many features such as custom bundle templates, for enforcing consistency and best practices across your organization.

  • Comprehensive support for deploying the code files and configuration for many Databricks resources.

  • Some knowledge of bundle configuration syntax is required to author a bundle.

  • There is no guarantee that the bundle deployment folder matches a remote Git commit. The production bundle folder could be accidentally edited in the workspace.

  • Requires external CI/CD pipelines such as GitHub Actions to trigger a deployment on merge.

Production Git folder

If you are not yet ready to adopt Databricks Asset Bundles, but want your code to be source controlled, you can set up a production Git folder. Then use external CI/CD tools such as GitHub Actions to pull the Git folder on merge, or when you do not have access to external CI/CD pipelines, create a scheduled job to pull to a Git folder in the workspace.

Pros

Cons

  • Supports lightweight, simple deployment for teams who have not adopted Databricks Asset Bundles.

  • Supports CI/CD for workspaces that use external orchestrators such as Airflow.

  • The production Git folder could be accidentally edited.

  • Only the code files, such as notebooks and dashboard drafts, are in source control. Configurations for jobs that run assets in the Git folder and configurations for publishing dashboards are not in source control.

  • Requires external CI/CD pipelines such as Github actions to trigger deployments on merge.

Git with jobs

If you only need CI/CD for jobs, Git with jobs enables you to configure some job types to use a remote Git repository as the source. When a job run begins, Databricks takes a snapshot commit of the remote repository and ensures that the entire job runs against the same version of the code.

Pros

Cons

  • Lightweight and can be authored entirely in the UI.

  • Does not require external CI/CD pipelines such as GitHub Actions to execute the latest code.

  • Ensures that production jobs execute remote code with no local edits, preventing unintentional changes to your production job.

  • Only supports limited job tasks.

  • Only the code files, such as notebooks and other files, are in source control. Job configurations such as task sequences, compute, and schedules are not source-controlled, making this approach less suitable for multi-environment, cross-workspace deployments.

  • Requires a Git connection at runtime. A job will fail if the Git connection is disrupted.

Other CI/CD recommendations

Regardless of the CI/CD approach that you choose, use service principals for CI/CD. See Service principals for CI/CD.

Databricks also recommends that you use the Databricks Terraform provider to manage your Databricks workspaces and the associated cloud infrastructure.

For more information on managing the lifecycle of Databricks assets and data, see the following documentation about CI/CD and data pipeline tools.

Area

Use these tools when you want to…

Databricks Asset Bundles

Programmatically define, deploy, and run Databricks jobs, DLT pipelines, and MLOps Stacks by using CI/CD best practices and workflows.

Databricks Terraform provider

Provision and manage Databricks workspaces and infrastructure using Terraform.

CI/CD workflows with Git and Databricks Git folders

Use GitHub and Databricks Git folders for source control and CI/CD workflows.

GitHub Actions

Include a GitHub Action developed for Databricks in your CI/CD workflow.

CI/CD with Jenkins on Databricks

Develop a CI/CD pipeline for Databricks that uses Jenkins.

Orchestrate Databricks jobs with Apache Airflow

Manage and schedule a data pipeline that uses Apache Airflow.

Service principals for CI/CD

Use service principals, instead of users, with CI/CD systems.