CI/CD on Databricks
Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering software in short, frequent cycles through the use of automation pipelines. CI/CD is common to software development, and is becoming increasingly necessary to data engineering and data science. By automating the building, testing, and deployment of code, development teams are able to deliver releases more reliably than with manual processes.
Common tools are available for developing CI/CD pipelines, but implementations and approaches from organization to organization may differ slightly due to unique aspects of each organization's software development lifecycle. This page provides information about the following approaches to CI/CD on Databricks, and pros and cons for each approach:
For an overview of CI/CD for machine learning projects on Databricks, see How does Databricks support CI/CD for machine learning?.
Databricks Asset Bundles (Recommended)
Databricks Asset Bundles are the recommended approach to CI/CD on Databricks. Use Databricks Asset Bundles to describe Databricks resources such as jobs and pipelines as source files, and bundle them together with other assets to provide an end-to-end definition of a deployable project. These bundles of files can be source controlled, and you can use external CI/CD automation such as Github Actions to trigger deployments.
Pros | Cons |
---|---|
|
|
Production Git folder
If you are not yet ready to adopt Databricks Asset Bundles, but want your code to be source controlled, you can set up a production Git folder. Then use external CI/CD tools such as GitHub Actions to pull the Git folder on merge, or when you do not have access to external CI/CD pipelines, create a scheduled job to pull to a Git folder in the workspace.
Pros | Cons |
---|---|
|
|
Git with jobs
If you only need CI/CD for jobs, Git with jobs enables you to configure some job types to use a remote Git repository as the source. When a job run begins, Databricks takes a snapshot commit of the remote repository and ensures that the entire job runs against the same version of the code.
Pros | Cons |
---|---|
|
|
Other CI/CD recommendations
Regardless of the CI/CD approach that you choose, use service principals for CI/CD. See Service principals for CI/CD.
Databricks also recommends that you use the Databricks Terraform provider to manage your Databricks workspaces and the associated cloud infrastructure.
Related links
For more information on managing the lifecycle of Databricks assets and data, see the following documentation about CI/CD and data pipeline tools.
Area | Use these tools when you want to… |
---|---|
Programmatically define, deploy, and run Databricks jobs, DLT pipelines, and MLOps Stacks by using CI/CD best practices and workflows. | |
Provision and manage Databricks workspaces and infrastructure using Terraform. | |
Use GitHub and Databricks Git folders for source control and CI/CD workflows. | |
Include a GitHub Action developed for Databricks in your CI/CD workflow. | |
Develop a CI/CD pipeline for Databricks that uses Jenkins. | |
Manage and schedule a data pipeline that uses Apache Airflow. | |
Use service principals, instead of users, with CI/CD systems. |