Databricks asset bundles for Python wheels
Preview
This feature is in Public Preview.
This article describes how to build, deploy, and run a Python wheel as part of a Databricks asset bundle project. See What are Databricks asset bundles?
Requirements
Databricks CLI version 0.205 or above. To check your installed version of the Databricks CLI, run the command
databricks -v
. To install Databricks CLI version 0.205 or above, see Install or update the Databricks CLI.The remote workspace must have workspace files enabled. See What are workspace files?.
Decision: Create the bundle manually or by using a template
Decide whether you want to create a starter bundle by using a template or to create the bundle manually. Creating the bundle by using a template is faster and easier, but the bundle might produce content that is not needed, and the bundle’s default settings must be further customized for real applications. Creating the bundle manually gives you full control over the bundle’s settings, but you must be familiar with how bundles work, as you are doing all of the work from the beginning. Choose one of the following sets of steps:
Create the bundle by using a template
In these steps, you create the bundle by using the Databricks default bundle template for Python. These steps guide you to create a bundle that consists of files to build into a Python wheel and the definition of a Databricks job to build this Python wheel. You then validate, deploy, and build the deployed files into a Python wheel from the Python wheel job within your Databricks workspace.
Step 1: Set up authentication
In this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This demo uses a Databricks configuration profile and a Databricks personal access token for authentication. See Databricks personal access token authentication and Databricks configuration profiles. For additional authentication types you can use instead, see Databricks client unified authentication.
From your terminal or command prompt, use the Databricks CLI to run the
databricks configure
command:databricks configure
For
Databricks Host
, enter the value of your workspace URL, for examplehttps://dbc-a1b2345c-d6e7.cloud.databricks.com
, and pressEnter
.For
Personal Access Token
, enter the value of your Databricks personal access token, and pressEnter
.
Step 2: Create the bundle
A bundle contains the artifacts you want to deploy and the settings for the workflows you want to run.
Use your terminal or command prompt to switch to a directory on your local development machine that will contain the template’s generated bundle.
Use the Dataricks CLI version to run the
bundle init
command:databricks bundle init
For
Template to use
, leave the default value ofdefault-python
by pressingEnter
.For
Unique name for this project
, leave the default value ofmy_project
, or type a different value, and then pressEnter
. This determines the name of the root directory for this bundle. This root directory is created within your current working directory.For
Include a stub (sample) notebook
, typeno
and pressEnter
. This instructs the Databricks CLI to not add a sample notebook to your bundle.For
Include a stub (sample) DLT pipeline
, typeno
and pressEnter
. This instructs the Databricks CLI to not define a sample Delta Live Tables pipeline in your bundle.For
Include a stub (sample) Python package
, leave the default value ofyes
by pressingEnter
. This instructs the Databricks CLI to add sample Python wheel package files and related build instructions to your bundle.
Step 3: Explore the bundle
To view the files that the template generated, switch to the root directory of your newly created bundle and open this directory with your preferred IDE, for example Visual Studio Code. Files of particular interest include the following:
databricks.yml
: This file specifies the bundle’s programmatic name, includes a reference to the Python wheel job definition, and specifies settings about the target workspace.resources/<project-name>_job.yml
: This file specifies the Python wheel job’s settings.src/<project-name>
: This directory include the files that the Python wheel job uses to build the Python wheel.
Step 4: Validate the project’s bundle settings file
In this step, you check whether the bundle settings are valid.
From the root directory, use the Databricks CLI to run the
bundle validate
command, as follows:databricks bundle validate
If a JSON representation of the bundle settings is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.
If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle settings are still valid.
Step 5: Build the Python wheel and deploy the local project to the remote workspace
In this step, you build the Python wheel, deploy the built Python wheel to your remote Databricks workspace, and create the Databricks job within your workspace.
In the Visual Studio Code terminal, use the Databricks CLI to run the
bundle validate
command as follows:databricks bundle deploy -t dev
Check whether the locally built Python wheel was deployed: In your Databricks workspace’s sidebar, click Workspace.
Click into the Users >
<your-username>
> .bundle ><project-name>
> dev > artifacts > .internal ><random-guid>
folder. The Python wheel should be in this folder.Check whether the job was created: In your Databricks workspace’s sidebar, click Workflows.
On the Jobs tab, click [dev
<your-username>
]<project-name>
_job.Click the Tasks tab. There should be one task: main_task.
If you make any changes to your bundle after this step, you should repeat steps 4-5 to check whether your bundle settings are still valid and then redeploy the project.
Step 6: Run the deployed project
In this step, you run the Databricks job in your workspace.
From the root directory, use the Databricks CLI to run the
bundle run
command, as follows, replacing<project-name>
with the name of your project from Step 2:databricks bundle run -t dev <project-name>_job
Copy the value of
Run URL
that appears in your terminal and paste this value into your web browser to open your Databricks workspace.In your Databricks workspace, after the task completes successfully and shows a green title bar, click the main_task task to see the results.
If you make any changes to your bundle after this step, you should repeat steps 4-6 to check whether your bundle settings are still valid, redeploy the project, and run the redeployed project.
You have reached the end of the steps for creating a bundle by using a template.
Create the bundle manually
In these steps, you create the bundle from the beginning. These steps guide you to create a bundle that consists of two notebooks and the definition of a Databricks job to run these notebooks. You then validate, deploy, and run the deployed notebooks from the job within your Databricks workspace. These steps automate the quickstart titled Create your first workflow with a Databricks job.
Optionally, you might want to use an integrated development environment (IDE) that provides automatic schema suggestions and actions when working with YAML files. The following steps use Visual Studio Code with the YAML extension installed from the Visual Studio Code Marketplace.
These steps assume that you already know:
How to create, build, and work with Python wheels. See the Python Packaging User Guide.
How to use Python wheels as part of a Databricks job. See Use a Python wheel in a Databricks job.
Follow these instructions to create a sample bundle that builds a Python wheel, deploys the Python wheel, and then runs the deployed Python wheel.
If you have already built a Python wheel and just want to deploy and run it, skip ahead to specifying the Python wheel settings in the bundle settings file in Step 2.
Step 1: Set up authentication
In this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This demo uses a Databricks configuration profile and a Databricks personal access token for authentication. See Databricks personal access token authentication and Databricks configuration profiles. For additional authentication types you can use instead, see Databricks client unified authentication.
From your terminal or command prompt, use the Databricks CLI to run the
databricks configure
command:databricks configure
For
Databricks Host
, enter the value of your workspace URL, for examplehttps://dbc-a1b2345c-d6e7.cloud.databricks.com
, and pressEnter
.For
Personal Access Token
, enter the value of your Databricks personal access token, and pressEnter
.
Step 2: Create the bundle
A bundle contains the artifacts you want to deploy and the settings for the workflows you want to run.
In your bundle’s root, create a folder named
my_package
.In the
my_package
folder, create a file namedsetup.py
. Also create a child folder namedsrc
with the following three files__init__.py
,__main__.py
, andmy_module.py
:my_package |-- src | |-- __init__.py | |-- __main__.py | `-- my_module.py `-- setup.py
Add the following content to the
src/__init__.py
file and then save the file:__version__ = '0.0.1' __author__ = '<my-author-name>' import sys, os sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
Replace
<my-author-name>
with your name or the name of your organization.Add the following code to the
src/__main__.py
file and then save the file:import sys, os sys.path.append(os.path.join(os.path.dirname(__file__), "..", "..")) from src.my_module import * def main(): first = 200 second = 400 print(f"{first} + {second} = {add_two_numbers(first, second)}") print(f"{second} - {first} = {subtract_two_numbers(second, first)}") print(f"{first} * {second} = {multiply_two_numbers(first, second)}") print(f"{second} / {first} = {divide_two_numbers(second, first)}") if __name__ == "__main__": main()
Add the following code to the
src/my_module.py
file and then save the file:def add_two_numbers(a, b): return a + b def subtract_two_numbers(a, b): return a - b def multiply_two_numbers(a, b): return a * b def divide_two_numbers(a, b): return a / b
Add the following code to the
setup.py
file and then save the file:from setuptools import setup, find_packages import src setup( name = "my_package", version = src.__version__, author = src.__author__, url = "https://<my-url>", author_email = "<my-author-name>@<my-organization>", description = "<my-package-description>", packages = find_packages(include = ["src"]), entry_points={"group_1": "run=src.__main__:main"}, install_requires = ["setuptools"] )
Replace
https://<my-url>
with your organization’s URL.Replace
<my-author-name>@<my-organization>
with your organization’s primary email contact address.Replace
<my-package-description>
with a display description for your Python wheel.
Add the following code to the project’s bundle settings file:
Note
If you have already built a Python wheel and just want to deploy and run it, then modify the following bundle settings file by omitting the
artifacts
mapping. The Databricks CLI will automatically deploy the files that are specified in thelibraries
array’swhl
entries.# yaml-language-server: $schema=bundle_config_schema.json bundle: name: my-wheel-bundle artifacts: my-wheel: type: whl path: ./my_package resources: jobs: wheel-job: name: wheel-job tasks: - task_key: wheel-task new_cluster: spark_version: 13.3.x-scala2.12 node_type_id: i3.xlarge data_security_mode: USER_ISOLATION num_workers: 1 python_wheel_task: entry_point: run package_name: my_package libraries: - whl: ./my_package/dist/my_package-*.whl targets: dev: workspace: host: <workspace-url>
Replace
<workspace-url>
with your workspace instance URL, for examplehttps://dbc-a1b2345c-d6e7.cloud.databricks.com
.Note
The available mappings for a child artifact mapping are as follows:
type
is required, and it must bewhl
for Python wheel builds.path
is an optional, relative path from the location of the bundle settings file to the location of thesetup.py
file for the Python wheel. Ifpath
is not included, the Databricks CLI will attempt to find thesetup.py
file for the Python wheel in the bundle’s root.files
is an optional mapping that is not included in this example.files
includes a childsource
mapping, which you can use to specify non-default locations to include for complex build instructions. Locations are specified as relative paths from the location of the bundle settings file.build
is an optional mapping that is not included in this example.build
can be used to run non-default build commands. For Python wheel builds, the Databricks CLI assumes that it can find a local install of the Pythonwheel
package to run builds, and it runs the commandpython setup.py bdist_wheel
by default during each bundle deployment. To specify multiple build commands, separate each command with double-ampersand (&&
) characters.
If you intend to store this bundle with a Git provider, add a
.gitignore
file in the project’s root, and add the following entries to this file:.databricks my_package/build my_package/dist my_package/my_package.egg-info my_package/src/__pycache__
Step 3: Validate the project’s bundle settings file
In this step, you check whether the bundle settings are valid.
From the root directory, validate the bundle settings file:
databricks bundle validate
If a JSON representation of the bundle settings is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.
If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle settings are still valid.
Step 4: Build the Python wheel and deploy the local project to the remote workspace
Build the Python wheel locally, deploy the built Python wheel to your workspace, deploy the notebook to your workspace, and create the job in your workspace:
databricks bundle deploy
Step 5: Run the deployed project
Run the deployed job, which uses the deployed notebook to call the deployed Python wheel:
databricks bundle run wheel-job
In the output, copy the
Run URL
and paste it into your web browser’s address bar.In the job run’s Output page, the following results appear:
200 + 400 = 600 400 - 200 = 200 200 * 400 = 80000 400 / 200 = 2.0
If you make any changes to your bundle after this step, you should repeat steps 3-5 to check whether your bundle settings are still valid, redeploy the project, and run the redeployed project.