Skip to main content

CI/CD on Databricks

Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering software in short, frequent cycles through the use of automation pipelines. CI/CD is common in software development, and is becoming increasingly necessary in data engineering and data science. By automating the building, testing, and deployment of code, development teams are able to deliver releases more reliably than with manual processes.

Databricks provides tools for developing CI/CD pipelines that cater to approaches that may differ slightly from organization to organization due to unique aspects of each organization's software development lifecycle. This page provides information about available tools for CI/CD pipelines on Databricks. For details about CI/CD recommendations and best practices, see Best practices and recommended CI/CD workflows on Databricks.

For an overview of CI/CD for machine learning projects on Databricks, see How does Databricks support CI/CD for machine learning?.

High-level flow

A common flow for a Databricks CI/CD pipeline is:

  1. Version: Store your Databricks code and notebooks in a version control system like Git. This allows you to track changes over time and collaborate with other team members.
  2. Code: Develop code and unit tests in a Databricks notebook in the workspace or locally using an IDE.
  3. Build: Use Databricks Asset Bundles settings to automatically build certain artifacts during deployments.
  4. Deploy: Deploy changes to the Databricks workspace using Databricks Asset Bundles with tools like Azure DevOps, GitHub Actions, or Jenkins.
  5. Test: Develop and run automated tests to validate your code changes.
    • Use tools like pytest to test your integrations.
  6. Run: Use the Databricks CLI with Databricks Asset Bundles to automate runs in your Databricks workspaces.
  7. Monitor: Monitor the performance of your code and production workloads in Databricks using tools such as jobs monitoring. This helps you identify and resolve any issues that arise in your production environment.

Available tools

The following tools support CI/CD core principles: version all files and unify asset management, define infrastructure as code, isolate environments, automate testing, and monitor and automate rollbacks.

Area

Use these tools when you want to…

Databricks Asset Bundles

Programmatically define, deploy, and run Lakeflow Jobs, Lakeflow Declarative Pipelines, and MLOps Stacks by using CI/CD best practices and workflows.

Databricks Terraform provider

Provision and manage Databricks workspaces and infrastructure using Terraform.

Continuous integration and delivery on Databricks using Azure DevOps

Develop a CI/CD pipeline for Databricks that uses Azure DevOps.

GitHub Actions

Include a GitHub Action developed for Databricks in your CI/CD workflow.

CI/CD with Jenkins on Databricks

Develop a CI/CD pipeline for Databricks that uses Jenkins.

Orchestrate Lakeflow Jobs with Apache Airflow

Manage and schedule a data pipeline that uses Apache Airflow.

Service principals for CI/CD

Use service principals, instead of users, with CI/CD.

Authenticate access to Databricks using OAuth token federation

Use workload identity federation for CI/CD authentication, which eliminates the need for Databricks secrets, making it the most secure way to authenticate to Databricks.

Databricks Asset Bundles

Databricks Asset Bundles are the recommended approach to CI/CD on Databricks. Use Databricks Asset Bundles to describe Databricks resources such as jobs and pipelines as source files, and bundle them together with other assets to provide an end-to-end definition of a deployable project. These bundles of files can be source controlled, and you can use external CI/CD automation such as Github Actions to trigger deployments.

Bundles includes many features such as custom templates for enforcing consistency and best practices across your organization, and comprehensive support for deploying the code files and configuration for many Databricks resources. Some knowledge of bundle configuration syntax is required to author a bundle.

For recommendations on how to use bundles in CI/CD, see Best practices and recommended CI/CD workflows on Databricks.

Other tools for source control

As an alternative to applying full CI/CD with Databricks Asset Bundles, Databricks offers options to only source-control and deploy code files and notebooks.

  • Git folder: Git folders can be used to reflect the state of a remote Git respository. You can create a git folder for production to manage source-controlled source files and notebooks. Then manually pull the Git folder to the latest state, or you use external CI/CD tools such as GitHub Actions to pull the Git folder on merge, or when you do not have access to external CI/CD pipelines. This approach works for external orchestrators such as Airflow, but note that only the code files, such as notebooks and dashboard drafts, are in source control. Configurations for jobs or pipelines that run assets in the Git folder and configurations for publishing dashboards are not in source control.

  • Git with jobs: If you only need source control for code files for a job, enables you to configure some job types to use a remote Git repository as the source. When a job run begins, Databricks takes a snapshot commit of the remote repository and ensures that the entire job runs against the same version of the code. This approach only supports limited job tasks. In addition, only the code files, such as notebooks and other files, are in source control. Job configurations such as task sequences, compute, and schedules are not source-controlled, making this approach less suitable for multi-environment, cross-workspace deployments.