Databricks asset bundles for Python wheels

Preview

This feature is in Public Preview.

This article describes how to build, deploy, and run a Python wheel as part of a Databricks asset bundle project. See What are Databricks asset bundles?

Requirements

  • Databricks CLI version 0.205 or above. To check your installed version of the Databricks CLI, run the command databricks -v. To install Databricks CLI version 0.205 or above, see Install or update the Databricks CLI.

  • The remote workspace must have workspace files enabled. See What are workspace files?.

Decision: Create the bundle manually or by using a template

Decide whether you want to create a starter bundle by using a template or to create the bundle manually. Creating the bundle by using a template is faster and easier, but the bundle might produce content that is not needed, and the bundle’s default settings must be further customized for real applications. Creating the bundle manually gives you full control over the bundle’s settings, but you must be familiar with how bundles work, as you are doing all of the work from the beginning. Choose one of the following sets of steps:

Create the bundle by using a template

In these steps, you create the bundle by using the Databricks default bundle template for Python. These steps guide you to create a bundle that consists of files to build into a Python wheel and the definition of a Databricks job to build this Python wheel. You then validate, deploy, and build the deployed files into a Python wheel from the Python wheel job within your Databricks workspace.

Step 1: Set up authentication

In this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This demo uses a Databricks configuration profile and a Databricks personal access token for authentication. See Databricks personal access token authentication and Databricks configuration profiles. For additional authentication types you can use instead, see Databricks client unified authentication.

  1. From your terminal or command prompt, use the Databricks CLI to run the databricks configure command:

    databricks configure
    
  2. For Databricks Host, enter the value of your workspace URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com, and press Enter.

  3. For Personal Access Token, enter the value of your Databricks personal access token, and press Enter.

Step 2: Create the bundle

A bundle contains the artifacts you want to deploy and the settings for the workflows you want to run.

  1. Use your terminal or command prompt to switch to a directory on your local development machine that will contain the template’s generated bundle.

  2. Use the Dataricks CLI version to run the bundle init command:

    databricks bundle init
    
  3. For Template to use, leave the default value of default-python by pressing Enter.

  4. For Unique name for this project, leave the default value of my_project, or type a different value, and then press Enter. This determines the name of the root directory for this bundle. This root directory is created within your current working directory.

  5. For Include a stub (sample) notebook, type no and press Enter. This instructs the Databricks CLI to not add a sample notebook to your bundle.

  6. For Include a stub (sample) DLT pipeline, type no and press Enter. This instructs the Databricks CLI to not define a sample Delta Live Tables pipeline in your bundle.

  7. For Include a stub (sample) Python package, leave the default value of yes by pressing Enter. This instructs the Databricks CLI to add sample Python wheel package files and related build instructions to your bundle.

Step 3: Explore the bundle

To view the files that the template generated, switch to the root directory of your newly created bundle and open this directory with your preferred IDE, for example Visual Studio Code. Files of particular interest include the following:

  • databricks.yml: This file specifies the bundle’s programmatic name, includes a reference to the Python wheel job definition, and specifies settings about the target workspace.

  • resources/<project-name>_job.yml: This file specifies the Python wheel job’s settings.

  • src/<project-name>: This directory include the files that the Python wheel job uses to build the Python wheel.

Step 4: Validate the project’s bundle settings file

In this step, you check whether the bundle settings are valid.

  1. From the root directory, use the Databricks CLI to run the bundle validate command, as follows:

    databricks bundle validate
    
  2. If a JSON representation of the bundle settings is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.

If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle settings are still valid.

Step 5: Build the Python wheel and deploy the local project to the remote workspace

In this step, you build the Python wheel, deploy the built Python wheel to your remote Databricks workspace, and create the Databricks job within your workspace.

  1. In the Visual Studio Code terminal, use the Databricks CLI to run the bundle validate command as follows:

    databricks bundle deploy -t dev
    
  2. Check whether the locally built Python wheel was deployed: In your Databricks workspace’s sidebar, click Workspace.

  3. Click into the Users > <your-username> > .bundle > <project-name> > dev > artifacts > .internal > <random-guid> folder. The Python wheel should be in this folder.

  4. Check whether the job was created: In your Databricks workspace’s sidebar, click Workflows.

  5. On the Jobs tab, click [dev <your-username>] <project-name>_job.

  6. Click the Tasks tab. There should be one task: main_task.

If you make any changes to your bundle after this step, you should repeat steps 4-5 to check whether your bundle settings are still valid and then redeploy the project.

Step 6: Run the deployed project

In this step, you run the Databricks job in your workspace.

  1. From the root directory, use the Databricks CLI to run the bundle run command, as follows, replacing <project-name> with the name of your project from Step 2:

    databricks bundle run -t dev <project-name>_job
    
  2. Copy the value of Run URL that appears in your terminal and paste this value into your web browser to open your Databricks workspace.

  3. In your Databricks workspace, after the task completes successfully and shows a green title bar, click the main_task task to see the results.

If you make any changes to your bundle after this step, you should repeat steps 4-6 to check whether your bundle settings are still valid, redeploy the project, and run the redeployed project.

You have reached the end of the steps for creating a bundle by using a template.

Create the bundle manually

In these steps, you create the bundle from the beginning. These steps guide you to create a bundle that consists of two notebooks and the definition of a Databricks job to run these notebooks. You then validate, deploy, and run the deployed notebooks from the job within your Databricks workspace. These steps automate the quickstart titled Create your first workflow with a Databricks job.

Optionally, you might want to use an integrated development environment (IDE) that provides automatic schema suggestions and actions when working with YAML files. The following steps use Visual Studio Code with the YAML extension installed from the Visual Studio Code Marketplace.

These steps assume that you already know:

Follow these instructions to create a sample bundle that builds a Python wheel, deploys the Python wheel, and then runs the deployed Python wheel.

If you have already built a Python wheel and just want to deploy and run it, skip ahead to specifying the Python wheel settings in the bundle settings file in Step 2.

Step 1: Set up authentication

In this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This demo uses a Databricks configuration profile and a Databricks personal access token for authentication. See Databricks personal access token authentication and Databricks configuration profiles. For additional authentication types you can use instead, see Databricks client unified authentication.

  1. From your terminal or command prompt, use the Databricks CLI to run the databricks configure command:

    databricks configure
    
  2. For Databricks Host, enter the value of your workspace URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com, and press Enter.

  3. For Personal Access Token, enter the value of your Databricks personal access token, and press Enter.

Step 2: Create the bundle

A bundle contains the artifacts you want to deploy and the settings for the workflows you want to run.

  1. In your bundle’s root, create a folder named my_package.

  2. In the my_package folder, create a file named setup.py. Also create a child folder named src with the following three files __init__.py, __main__.py, and my_module.py:

    my_package
      |-- src
      |    |-- __init__.py
      |    |-- __main__.py
      |    `-- my_module.py
      `-- setup.py
    
  3. Add the following content to the src/__init__.py file and then save the file:

    __version__ = '0.0.1'
    __author__ = '<my-author-name>'
    
    
    import sys, os
    
    sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
    

    Replace <my-author-name> with your name or the name of your organization.

  4. Add the following code to the src/__main__.py file and then save the file:

    import sys, os
    
    sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
    
    from src.my_module import *
    
    def main():
    
      first = 200
      second = 400
    
      print(f"{first} + {second} = {add_two_numbers(first, second)}")
      print(f"{second} - {first} = {subtract_two_numbers(second, first)}")
      print(f"{first} * {second} = {multiply_two_numbers(first, second)}")
      print(f"{second} / {first} = {divide_two_numbers(second, first)}")
    
    if __name__ == "__main__":
      main()
    
  5. Add the following code to the src/my_module.py file and then save the file:

    def add_two_numbers(a, b):
      return a + b
    
    def subtract_two_numbers(a, b):
      return a - b
    
    def multiply_two_numbers(a, b):
      return a * b
    
    def divide_two_numbers(a, b):
      return a / b
    
  6. Add the following code to the setup.py file and then save the file:

    from setuptools import setup, find_packages
    
    import src
    
    setup(
      name = "my_package",
      version = src.__version__,
      author = src.__author__,
      url = "https://<my-url>",
      author_email = "<my-author-name>@<my-organization>",
      description = "<my-package-description>",
      packages = find_packages(include = ["src"]),
      entry_points={"group_1": "run=src.__main__:main"},
      install_requires = ["setuptools"]
    )
    
    • Replace https://<my-url> with your organization’s URL.

    • Replace <my-author-name>@<my-organization> with your organization’s primary email contact address.

    • Replace <my-package-description> with a display description for your Python wheel.

  7. Add the following code to the project’s bundle settings file:

    Note

    If you have already built a Python wheel and just want to deploy and run it, then modify the following bundle settings file by omitting the artifacts mapping. The Databricks CLI will automatically deploy the files that are specified in the libraries array’s whl entries.

    # yaml-language-server: $schema=bundle_config_schema.json
    bundle:
      name: my-wheel-bundle
    
    artifacts:
      my-wheel:
        type: whl
        path: ./my_package
    
    resources:
      jobs:
        wheel-job:
          name: wheel-job
          tasks:
            - task_key: wheel-task
              new_cluster:
                spark_version: 13.3.x-scala2.12
                node_type_id: i3.xlarge
                data_security_mode: USER_ISOLATION
                num_workers: 1
              python_wheel_task:
                entry_point: run
                package_name: my_package
              libraries:
                - whl: ./my_package/dist/my_package-*.whl
    
    targets:
      dev:
        workspace:
        host: <workspace-url>
    

    Replace <workspace-url> with your workspace instance URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com.

    Note

    The available mappings for a child artifact mapping are as follows:

    • type is required, and it must be whl for Python wheel builds.

    • path is an optional, relative path from the location of the bundle settings file to the location of the setup.py file for the Python wheel. If path is not included, the Databricks CLI will attempt to find the setup.py file for the Python wheel in the bundle’s root.

    • files is an optional mapping that is not included in this example. files includes a child source mapping, which you can use to specify non-default locations to include for complex build instructions. Locations are specified as relative paths from the location of the bundle settings file.

    • build is an optional mapping that is not included in this example. build can be used to run non-default build commands. For Python wheel builds, the Databricks CLI assumes that it can find a local install of the Python wheel package to run builds, and it runs the command python setup.py bdist_wheel by default during each bundle deployment. To specify multiple build commands, separate each command with double-ampersand (&&) characters.

  8. If you intend to store this bundle with a Git provider, add a .gitignore file in the project’s root, and add the following entries to this file:

    .databricks
    my_package/build
    my_package/dist
    my_package/my_package.egg-info
    my_package/src/__pycache__
    

Step 3: Validate the project’s bundle settings file

In this step, you check whether the bundle settings are valid.

  1. From the root directory, validate the bundle settings file:

    databricks bundle validate
    
  2. If a JSON representation of the bundle settings is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.

If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle settings are still valid.

Step 4: Build the Python wheel and deploy the local project to the remote workspace

Build the Python wheel locally, deploy the built Python wheel to your workspace, deploy the notebook to your workspace, and create the job in your workspace:

databricks bundle deploy

Step 5: Run the deployed project

  1. Run the deployed job, which uses the deployed notebook to call the deployed Python wheel:

    databricks bundle run wheel-job
    
  2. In the output, copy the Run URL and paste it into your web browser’s address bar.

  3. In the job run’s Output page, the following results appear:

    200 + 400 = 600
    400 - 200 = 200
    200 * 400 = 80000
    400 / 200 = 2.0
    

If you make any changes to your bundle after this step, you should repeat steps 3-5 to check whether your bundle settings are still valid, redeploy the project, and run the redeployed project.