CI/CD on Databricks
Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering software in short, frequent cycles through the use of automation pipelines. CI/CD is common in software development, and is becoming increasingly necessary in data engineering and data science. By automating the building, testing, and deployment of code, development teams are able to deliver releases more reliably than with manual processes.
Databricks provides tools for developing CI/CD pipelines that cater to approaches that may differ slightly from organization to organization due to unique aspects of each organization's software development lifecycle. This page provides information about available tools for CI/CD pipelines on Databricks. For details about CI/CD recommendations and best practices, see Best practices and recommended CI/CD workflows on Databricks.
For an overview of CI/CD for machine learning projects on Databricks, see How does Databricks support CI/CD for machine learning?.
High-level flow
A common flow for a Databricks CI/CD pipeline is:
- Version: Store your Databricks code and notebooks in a version control system like Git. This allows you to track changes over time and collaborate with other team members.
- Individual users use a Git folder to author and test changes before committing them to a Git repository. See CI/CD with Databricks Git folders (Repos).
- Optionally configure bundle Git settings.
- Code: Develop code and unit tests in a Databricks notebook in the workspace or locally using an IDE.
- Use the Lakeflow Pipelines Editor to develop pipelines in the workspace.
- Use the Databricks Visual Studio Code extension to develop and deploy local changes to Databricks workspaces.
- Build: Use Databricks Asset Bundles settings to automatically build certain artifacts during deployments.
- Configure the bundle configuration artifacts mapping.
- Pylint extended with the Databricks Labs pylint plugin help to enforce coding standards and detect bugs in your Databricks notebooks and application code.
- Deploy: Deploy changes to the Databricks workspace using Databricks Asset Bundles with tools like Azure DevOps, GitHub Actions, or Jenkins.
- Configure deployments using bundle deployment modes.
- For details about using Azure DevOps and Databricks, see Continuous integration and delivery on Databricks using Azure DevOps.
- For Databricks GitHub Actions examples, see GitHub Actions.
- Test: Develop and run automated tests to validate your code changes.
- Use tools like pytest to test your integrations.
- Run: Use the Databricks CLI with Databricks Asset Bundles to automate runs in your Databricks workspaces.
- Run bundle resources using databricks bundle run.
- Monitor: Monitor the performance of your code and production workloads in Databricks using tools such as jobs monitoring. This helps you identify and resolve any issues that arise in your production environment.
Available tools
The following tools support CI/CD core principles: version all files and unify asset management, define infrastructure as code, isolate environments, automate testing, and monitor and automate rollbacks.
Area | Use these tools when you want to… |
---|---|
Programmatically define, deploy, and run Lakeflow Jobs, Lakeflow Declarative Pipelines, and MLOps Stacks by using CI/CD best practices and workflows. | |
Provision and manage Databricks workspaces and infrastructure using Terraform. | |
Continuous integration and delivery on Databricks using Azure DevOps | Develop a CI/CD pipeline for Databricks that uses Azure DevOps. |
Include a GitHub Action developed for Databricks in your CI/CD workflow. | |
Develop a CI/CD pipeline for Databricks that uses Jenkins. | |
Manage and schedule a data pipeline that uses Apache Airflow. | |
Use service principals, instead of users, with CI/CD. | |
Authenticate access to Databricks using OAuth token federation | Use workload identity federation for CI/CD authentication, which eliminates the need for Databricks secrets, making it the most secure way to authenticate to Databricks. |
Databricks Asset Bundles
Databricks Asset Bundles are the recommended approach to CI/CD on Databricks. Use Databricks Asset Bundles to describe Databricks resources such as jobs and pipelines as source files, and bundle them together with other assets to provide an end-to-end definition of a deployable project. These bundles of files can be source controlled, and you can use external CI/CD automation such as Github Actions to trigger deployments.
Bundles includes many features such as custom templates for enforcing consistency and best practices across your organization, and comprehensive support for deploying the code files and configuration for many Databricks resources. Some knowledge of bundle configuration syntax is required to author a bundle.
For recommendations on how to use bundles in CI/CD, see Best practices and recommended CI/CD workflows on Databricks.
Other tools for source control
As an alternative to applying full CI/CD with Databricks Asset Bundles, Databricks offers options to only source-control and deploy code files and notebooks.
-
Git folder: Git folders can be used to reflect the state of a remote Git respository. You can create a git folder for production to manage source-controlled source files and notebooks. Then manually pull the Git folder to the latest state, or you use external CI/CD tools such as GitHub Actions to pull the Git folder on merge, or when you do not have access to external CI/CD pipelines. This approach works for external orchestrators such as Airflow, but note that only the code files, such as notebooks and dashboard drafts, are in source control. Configurations for jobs or pipelines that run assets in the Git folder and configurations for publishing dashboards are not in source control.
-
Git with jobs: If you only need source control for code files for a job, enables you to configure some job types to use a remote Git repository as the source. When a job run begins, Databricks takes a snapshot commit of the remote repository and ensures that the entire job runs against the same version of the code. This approach only supports limited job tasks. In addition, only the code files, such as notebooks and other files, are in source control. Job configurations such as task sequences, compute, and schedules are not source-controlled, making this approach less suitable for multi-environment, cross-workspace deployments.