Skip to main content

CI/CD using Databricks Asset Bundles

Databricks recommends using Databricks Asset Bundles for CI/CD, which simplify the development and deployment of complex data, analytics, and ML projects for the Databricks platform. Bundles allow you to easily manage many custom configurations and automate builds, tests, and deployments of your projects to Databricks development, staging, and production workspaces.

For more information about recommended CI/CD best practices and workflows with bundles, see Best practices and recommended CI/CD workflows on Databricks.

For information about other approaches to CI/CD in Databricks, see CI/CD on Databricks.

How do I use Databricks Asset Bundles as part of my CI/CD pipeline on Databricks?

You can use Databricks Asset Bundles to define and programmatically manage your Databricks CI/CD implementation, which usually includes:

  • Notebooks: Databricks notebooks are often a key part of data engineering and data science workflows. You can use version control for notebooks, and also validate and test them as part of a CI/CD pipeline. You can run automated tests against notebooks to check whether they are functioning as expected.
  • Libraries: Manage the library dependencies required to run your deployed code. Use version control on libraries and include them in automated testing and validation.
  • Workflows: Databricks Jobs are comprised of jobs that allow you to schedule and run automated tasks using notebooks or Spark jobs.
  • Data pipelines: You can also include data pipelines in CI/CD automation, using DLT, the framework in Databricks for declaring data pipelines.
  • Infrastructure: Infrastructure configuration includes definitions and provisioning information for clusters, workspaces, and storage for target environments. Infrastructure changes can be validated and tested as part of a CI/CD pipeline, ensuring that they are consistent and error-free.

A common flow for a Databricks CI/CD pipeline with bundles is:

  1. Store: Store your Databricks code and notebooks in a version control system like Git. This allows you to track changes over time and collaborate with other team members. See CI/CD techniques with Git and Databricks Git folders (Repos) and bundle Git settings.
  2. Code: Develop code and unit tests in a Databricks notebook in the workspace or locally using an external IDE. Databricks provides a Visual Studio Code extension that makes it easy to develop and deploy changes to Databricks workspaces.
  3. Build: Use Databricks Asset Bundles settings to automatically build certain artifacts during deployments. See artifacts. In addition, Pylint extended with the Databricks Labs pylint plugin help to enforce coding standards and detect bugs in your Databricks notebooks and application code.
  4. Deploy: Deploy changes to the Databricks workspace using Databricks Asset Bundles in conjunction with tools like Azure DevOps, Jenkins, or GitHub Actions. See Databricks Asset Bundle deployment modes. For GitHub Actions examples, see GitHub Actions.
  5. Test: Develop and run automated tests to validate your code changes using tools like pytest. To test your integrations with workspace APIs, the Databricks Labs pytest plugin allows you to create workspace objects and clean them up after tests finish.
  6. Run: Use the Databricks CLI in conjunction with Databricks Asset Bundles to automate runs in your Databricks workspaces. See Run a job or pipeline.
  7. Monitor: Monitor the performance of your code and workflows in Databricks using tools like Azure Monitor or Datadog. This helps you identify and resolve any issues that arise in your production environment.
  8. Iterate: Make small, frequent iterations to improve and update your data engineering or data science project. Small changes are easier to roll back than large ones.