What are Databricks Asset Bundles?
Databricks Asset Bundles are a tool to facilitate the adoption of software engineering best practices, including source control, code review, testing, and continuous integration and delivery (CI/CD), for your data and AI projects. Bundles make it possible to describe Databricks resources such as jobs, pipelines, and notebooks as source files. These source files provide an end-to-end definition of a project, including how it should be structured, tested, and deployed, which makes it easier to collaborate on projects during active development.
Bundles provide a way to include metadata alongside your project’s source files. When you deploy a project using bundles, this metadata is used to provision infrastructure and other resources. Your project’s collection of source files and metadata is then deployed as a single bundle to your target environment. A bundle includes the following parts:
Required cloud infrastructure and workspace configurations
Source files, such as notebooks and Python files, that include the business logic
Definitions and settings for Databricks resources, such as Databricks jobs, Delta Live Tables pipelines, Model Serving endpoints, MLflow Experiments, and MLflow registered models
Unit tests and integration tests
The following diagram provides a high-level view of a development and CI/CD pipeline with bundles:
When should I use Databricks Asset Bundles?
Databricks Assets Bundles are an infrastructure-as-code (IaC) approach to managing your Databricks projects. Use them when you want to manage complex projects where multiple contributors and automation are essential, and continuous integration and deployment (CI/CD) are a requirement. Since bundles are defined and managed through YAML templates and files you create and maintain alongside source code, they map well to scenarios where IaC is an appropriate approach.
Some ideal scenarios for bundles include:
Develop data, analytics, and ML projects in a team-based environment. Bundles can help you organize and manage various source files efficiently. This ensures smooth collaboration and streamlined processes.
Iterate on ML problems faster. Manage ML pipeline resources (such as training and batch inference jobs) by using ML projects that follow production best practices from the beginning.
Set organizational standards for new projects by authoring custom bundle templates that include default permissions, service principals, and CI/CD configurations.
Regulatory compliance: In industries where regulatory compliance is a significant concern, bundles can help maintain a versioned history of code and infrastructure work. This assists in governance and ensures that necessary compliance standards are met.
How do Databricks Asset Bundles work?
Bundle metadata is defined using YAML files that specify the artifacts, resources, and configuration of a Databricks project. You can create this YAML file manually or generate one using a bundle template. The Databricks CLI can then be used to validate, deploy, and run bundles using these bundle YAML files. You can run bundle projects from IDEs, terminals, or within Databricks directly. This article uses the Databricks CLI.
Bundles can be created manually or based on a template. The Databricks CLI provides default templates for simple use cases, but for more specific or complex jobs, you can create custom bundle templates to implement your team’s best practices and keep common configurations consistent.
For more details on the configuration YAML used to express Databricks Asset Bundles, see Databricks Asset Bundle configurations.
Configure your environment to use bundles
Use the Databricks CLI to easily deploy bundles from the command line. You can check if the Databricks CLI is installed and the current version you’re using by running the following command:
databricks --version
Note
Databricks CLI version 0.218.0 or higher is required. To install the Databricks CLI, see Install or update the Databricks CLI.
After installing the Databricks CLI, verify that your remote Databricks workspaces are configured correctly. Bundles require the workspace files feature to be enabled as this feature supports working with files other than Databricks Notebooks, such as .py
and .yml
files. If you’re using Databricks Runtime version 11.2 (or later) this feature should be enabled by default.
Authentication
Databricks provides several authentication methods. Databricks recommends that you use one the following methods to authenticate:
For attended authentication scenarios, such as manual workflows where you use your web browser to log in to your target Databricks workspace (when prompted by the Databricks CLI), use OAuth user-to-machine (U2M) authentication. This method is ideal for experimenting with the getting started tutorials for Databricks Asset Bundles or for the rapid development of bundles.
For unattended authentication scenarios, such as fully automated workflows in which there is no opportunity for you to use your web browser to log in to your target Databricks workspace at that time, use OAuth machine-to-machine (M2M) authentication. This method requires the use of Databricks service principals and is ideal for using Databricks Asset Bundles with CI/CD systems such as GitHub.
For OAuth U2M authentication, do the following:
Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.
In the following command, replace
<workspace-url>
with your Databricks workspace instance URL, for examplehttps://dbc-a1b2345c-d6e7.cloud.databricks.com
.databricks auth login --host <workspace-url>
The Databricks CLI prompts you to save the information that you entered as a Databricks configuration profile. Press
Enter
to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command
databricks auth profiles
. To view a specific profile’s existing settings, run the commanddatabricks auth env --profile <profile-name>
.In your web browser, complete the on-screen instructions to log in to your Databricks workspace.
To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:
databricks auth token --host <workspace-url>
databricks auth token -p <profile-name>
databricks auth token --host <workspace-url> -p <profile-name>
If you have multiple profiles with the same
--host
value, you might need to specify the--host
and-p
options together to help the Databricks CLI find the correct matching OAuth token information.
You can use this configuration profile’s name in one or more of the following ways whenever you validate, deploy, run, or destroy bundles:
With the command-line option
-p <profile-name>
, appended to the commandsdatabricks bundle validate
,databricks bundle deploy
,databricks bundle run
, ordatabricks bundle destroy
. See Databricks Asset Bundles development.As the value of the
profile
mapping in the bundle configuration file’s top-levelworkspace
mapping (although Databricks recommends that you use thehost
mapping set to the Databricks workspace’s URL instead of theprofile
mapping, as it makes your bundle configuration files more portable). See coverage of theprofile
mapping in workspace.If the configuration profile’s name is
DEFAULT
, it is used by default when the command-line option-p <profile-name>
or theprofile
(orhost
) mapping is not specified.
For OAuth M2M authentication, do the following:
Complete the OAuth M2M authentication setup instructions. See Use a service principal to authenticate with Databricks (OAuth M2M).
Install the Databricks CLI on the target compute resource in one of the following ways:
To manually install the Databricks CLI on the compute resource in real time, see Install or update the Databricks CLI.
To use GitHub Actions to automatically install the Databricks CLI on a GitHub virtual machine, see setup-cli in GitHub.
To use other CI/CD systems to automatically install the Databricks CLI on a virtual machine, see see your CI/CD system provider’s documentation and Install or update the Databricks CLI.
Set the following environment variables on the compute resource as follows:
DATABRICKS_HOST
, set to the Databricks workspace instance URL, for examplehttps://dbc-a1b2345c-d6e7.cloud.databricks.com
.DATABRICKS_CLIENT_ID
, set to the Databricks service principal’s Application ID value.DATABRICKS_CLIENT_SECRET
, set to the Databricks service principal’s OAuth Secret value.
To set these environment variables, see the documentation for your target compute resource’s operating system or CI/CD system.
Develop your first Databricks Asset Bundle
The fastest way to start bundle development is by using a template. Create your first bundle project using the Databricks CLI bundle init
command without any options. This presents a choice of Databricks-provided default bundle templates and asks a series of questions to initialize project variables.
databricks bundle init
Organizations can also create custom bundle templates to define their own standards. These standards might include default permissions, service principals, and custom CI/CD configuration. See Databricks Asset Bundle templates.
After you initialize your project, use the bundle validate
command to validate your bundle before deploying it to your workspaces.
databricks bundle validate
You typically create a bundle on a local development machine with an IDE and the Databricks CLI version 0.218.0 or above. These tools enable you to create, validate, deploy, and run a bundle. See Databricks Asset Bundles development.
You can edit a bundle in a Databricks workspace after you add the bundle to Git by using the Databricks Git folder integration. However, you cannot test or deploy a bundle from a workspace. Instead, you can use your local IDE for testing and CI/CD for deployment.
Next steps
Create a bundle that deploys a notebook to a Databricks workspace and then runs that deployed notebook as a Databricks job. See Develop a job on Databricks by using Databricks Asset Bundles.
Create a bundle that deploys a notebook to a Databricks workspace and then runs that deployed notebook as a Delta Live Tables pipeline. See Develop Delta Live Tables pipelines with Databricks Asset Bundles.
Create a bundle that deploys and runs an MLOps Stack. See Databricks Asset Bundles for MLOps Stacks.
Add a bundle to a CI/CD (continuous integration/continuous deployment) workflow in GitHub. See Run a CI/CD workflow with a Databricks Asset Bundle and GitHub Actions.
Create a bundle that builds, deploys, and calls a Python wheel file. See Develop a Python wheel file using Databricks Asset Bundles.
Create a custom template that you and others can use to create a bundle. See Databricks Asset Bundle templates.
Common tasks
Use the following articles to complete common tasks for Databricks Asset Bundles.
Article |
Use this article when you want to… |
---|---|
Learn about creating, validating, deploying, and running a bundle by authoring a |
|
Create a bundle’s |
|
Use substitutions and define custom variables to make your bundle configuration more modular and reusable. |
|
Set up a bundle project for Databricks authentication. |
|
Develop a job on Databricks by using Databricks Asset Bundles |
Create, deploy, and run a bundle for a Databricks job. |
Develop Delta Live Tables pipelines with Databricks Asset Bundles |
Create, deploy, and run a bundle for a Delta Live Tables pipeline. |
Create, deploy, and run a bundle for an MLOps Stack. |
|
Install libraries that a bundle needs to run on any related Databricks clusters. |
|
Use bundle deployment modes such as |
|
Use a template to make creating specific kinds of bundles faster, easier, and with more consistent and repeatable results. |
|
Apply granular access permissions levels to users, groups, and service principals for specific bundle resources. |
|
Define artifact settings dynamically in Databricks Asset Bundles |
Combine or override specific settings for artifacts in a bundle. |
Run a CI/CD workflow with a Databricks Asset Bundle and GitHub Actions |
Deploy or run a bundle in response to a specific GitHub workflow event such as a pull request or merge. |
Combine or override specific settings for clusters in a bundle. |
|
Add a task to a job in a bundle. |
|
Combine or override specific settings for job tasks in a bundle. |
|
Build, deploy, and call Python wheel files in a bundle. |