What are Databricks Asset Bundles?
Databricks Asset Bundles (DABs) are a new tool for streamlining the development of complex data, analytics, and ML projects for the Databricks platform. Bundles make it easy to manage complex projects during active development by providing CI/CD capabilities in your software development workflow with a single concise and declarative YAML syntax. By using bundles to automate your project’s tests, deployments, and configuration management you can reduce errors while promoting software best practices across your organization as templated projects.
Preview
This feature is in Public Preview.
Bundles provide a way to include metadata alongside your project’s source files. When you deploy a project using bundles, this metadata is used to provision infrastructure and other resources. Your project’s collection of source files and metadata is then deployed as a single bundle to your target environment.
A bundle includes the following parts:
Required cloud infrastructure and workspace configurations
Source files, such as notebooks and Python files, that include the business logic.
Definitions and settings for Databricks resources, such as Databricks jobs, Delta Live Tables pipelines, Model Serving endpoints, MLflow Experiments, and MLflow registered models.
Unit tests and integration tests.
When should I use Databricks Asset Bundles?
Databricks Assets Bundles are an infrastructure-as-code (IaC) approach to managing your Databricks projects. Use them when you want to manage complex projects where multiple contributors and automation are essential, and continuous integration and deployment (CI/CD) are a requirement. Since bundles are defined and managed through YAML templates and files you create and maintain alongside source code, they map well to scenarios where IaC is an appropriate approach.
Some ideal scenarios for bundles include:
Develop data, analytics, and ML projects in a team-based environment. Bundles can help you organize and manage various source files efficiently. This ensures smooth collaboration and streamlined processes.
Iterate on ML problems faster. Manage ML pipeline resources (such as training and batch inference jobs) by using ML projects that follow production best practices from the beginning.
Set organizational standards for new projects by authoring custom bundle templates that include default permissions, service principals, and CI/CD configurations.
Regulatory compliance: In industries where regulatory compliance is a significant concern, bundles can help maintain a versioned history of code and infrastructure work. This assists in governance and ensures that necessary compliance standards are met.
How do Databricks Asset Bundles work?
Bundle metadata is defined using YAML files that specify the artifacts, resources, and configuration of a Databricks project. You can create this YAML file manually or generate one using a bundle template. The Databricks CLI can then be used to validate, deploy, and run bundles using these bundle YAML files. You can deploy and run bundle projects from IDEs, terminals, or within Databricks directly. We’ll cover using the Databricks CLI in this article.
Bundles can be created manually or based on a template. The Databricks CLI provides default templates for simple use cases, but for more specific or complex jobs, you can create custom bundle templates to implement your team’s best practices and keep common configurations consistent.
For more details on the configuration YAML used to express Databricks Asset Bundles, see Databricks Asset Bundle configurations.
Configure your environment to use bundles
Use the Databricks CLI to easily deploy bundles from the command line. You can check if the Databricks CLI is installed and the current version you’re using by running the following command:
databricks --version
Note
Databricks CLI version 0.205.2 or higher is required. To install the Databricks CLI, see Install or update the Databricks CLI.
After installing the Databricks CLI, verify that your remote Databricks workspaces are configured correctly. Bundles require the workspace files feature to be enabled as this feature supports working with files other than Databricks Notebooks, such as .py
and .yml
files. If you’re using Databricks Runtime version 11.2 (or later) this feature should be enabled by default.
Authentication
Databricks provides several authentication methods. Databricks recommends that you use one the following methods to authenticate:
For attended authentication scenarios, such as manual workflows where you use your web browser to log in to your target Databricks workspace (when prompted by the Databricks CLI), use OAuth user-to-machine (U2M) authentication. This method is ideal for experimenting with the getting started tutorials for Databricks Asset Bundles or for the rapid development of bundles.
For unattended authentication scenarios, such as fully automated workflows in which there is no opportunity for you to use your web browser to log in to your target Databricks workspace at that time, use OAuth machine-to-machine (M2M) authentication. This method requires the use of Databricks service principals and is ideal for using Databricks Asset Bundles with CI/CD systems such as GitHub.
For OAuth U2M authentication, do the following:
Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.
In the following command, replace
<workspace-url>
with your Databricks workspace instance URL, for examplehttps://dbc-a1b2345c-d6e7.cloud.databricks.com
.databricks auth login --host <workspace-url>
The Databricks CLI prompts you to save the information that you entered as a Databricks configuration profile. Press
Enter
to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command
databricks auth profiles
. To view a specific profile’s existing settings, run the commanddatabricks auth env --profile <profile-name>
.In your web browser, complete the on-screen instructions to log in to your Databricks workspace.
To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:
databricks auth token --host <workspace-url>
databricks auth token -p <profile-name>
databricks auth token --host <workspace-url> -p <profile-name>
If you have multiple profiles with the same
--host
value, you might need to specify the--host
and-p
options together to help the Databricks CLI find the correct matching OAuth token information.
You can use this configuration profile’s name in one or more of the following ways whenever you validate, deploy, run, or destroy bundles:
With the command-line option
-p <profile-name>
, appended to the commandsdatabricks bundle validate
,databricks bundle deploy
,databricks bundle run
, ordatabricks bundle destroy
. See Databricks Asset Bundles development workflow.As the value of the
profile
mapping in the bundle configuration file’s top-levelworkspace
mapping (although Databricks recommends that you use thehost
mapping set to the Databricks workspace’s URL instead of theprofile
mapping, as it makes your bundle configuration files more portable). See coverage of theprofile
mapping in workspace.If the configuration profile’s name is
DEFAULT
, it is used by default when the command-line option-p <profile-name>
or theprofile
(orhost
) mapping is not specified.
For OAuth M2M authentication, do the following:
Complete the OAuth M2M authentication setup instructions. See OAuth machine-to-machine (M2M) authentication.
Install the Databricks CLI on the target compute resource in one of the following ways:
To manually install the Databricks CLI on the compute resource in real time, see Install or update the Databricks CLI.
To use GitHub Actions to automatically install the Databricks CLI on a GitHub virtual machine, see setup-cli in GitHub.
To use other CI/CD systems to automatically install the Databricks CLI on a virtual machine, see see your CI/CD system provider’s documentation and Install or update the Databricks CLI.
Set the following environment variables on the compute resource as follows:
DATABRICKS_HOST
, set to the Databricks workspace instance URL, for examplehttps://dbc-a1b2345c-d6e7.cloud.databricks.com
.DATABRICKS_CLIENT_ID
, set to the Databricks service principal’s Application ID value.DATABRICKS_CLIENT_SECRET
, set to the Databricks service principal’s OAuth Secret value.
To set these environment variables, see the documentation for your target compute resource’s operating system or CI/CD system.
Develop your first Databricks Asset Bundle
The fastest way to start bundle development is by using a template. Create your first bundle project using the Databricks CLI bundle init
command without any options. This presents a choice of Databricks-provided default bundle templates and asks a series of questions to initialize project variables.
databricks bundle init
Organizations can also create custom bundle templates to define their own standards. These standards might include default permissions, service principals, and custom CI/CD configuration. See Databricks Asset Bundle templates.
After you initialize your project, use the bundle validate
command to validate your bundle before deploying it to your workspaces.
databricks bundle validate
You typically create a bundle on a local development machine with an IDE and the Databricks CLI version 0.205 or above. These tools enable you to create, validate, deploy, and run a bundle. See Databricks Asset Bundles development workflow.
You can edit a bundle in a Databricks workspace after you add the bundle to Git by using the Databricks Git folder integration. However, you cannot test or deploy a bundle from a workspace. Instead, you can use your local IDE for testing and CI/CD for deployment.
Next steps
Create a bundle that deploys a notebook to a Databricks workspace and then runs that deployed notebook as a Databricks job. See Develop a job on Databricks by using Databricks Asset Bundles.
Create a bundle that deploys a notebook to a Databricks workspace and then runs that deployed notebook as a Delta Live Tables pipeline. See Develop a Delta Live Tables pipeline by using Databricks Asset Bundles.
Create a bundle that deploys and runs an MLOps Stack. See Databricks Asset Bundles for MLOps Stacks.
Add a bundle to a CI/CD (continuous integration/continuous deployment) workflow in GitHub. See Run a CI/CD workflow with a Databricks Asset Bundle and GitHub Actions.
Create a bundle that builds, deploys, and calls a Python wheel file. See Develop a Python wheel file using Databricks Asset Bundles.
Create a custom template that you and others can use to create a bundle. See Databricks Asset Bundle templates.
Common tasks
Use the following articles to complete common tasks for Databricks Asset Bundles.
Article |
Use this article when you want to… |
---|---|
Learn about the worklow for creating, validating, deploying, and running a bundle by authoring a |
|
Create a bundle’s |
|
Set up a bundle project for Databricks authentication. |
|
Develop a job on Databricks by using Databricks Asset Bundles |
Create, deploy, and run a bundle for a Databricks job. |
Develop a Delta Live Tables pipeline by using Databricks Asset Bundles |
Create, deploy, and run a bundle for a Delta Live Tables pipeline. |
Create, deploy, and run a bundle for an MLOps Stack. |
|
Install libraries that a bundle needs to run on any related Databricks clusters. |
|
Use bundle deployment modes such as |
|
Use a template to make creating specific kinds of bundles faster, easier, and with more consistent and repeatable results. |
|
Apply granular access permissions levels to users, groups, and service principals for specific bundle resources. |
|
Define artifact settings dynamically in Databricks Asset Bundles |
Combine or override specific settings for artifacts in a bundle. |
Run a CI/CD workflow with a Databricks Asset Bundle and GitHub Actions |
Deploy or run a bundle in response to a specific GitHub workflow event such as a pull request or merge. |
Combine or override specific settings for clusters in a bundle. |
|
Add a task to a job in a bundle. |
|
Combine or override specific settings for job tasks in a bundle. |
|
Build, deploy, and call Python wheel files in a bundle. |