Tutorial: Create your first custom Databricks Asset Bundle template

In this tutorial, you’ll create a custom Databricks Asset Bundle template and learn how to use it to automate more complex processing tasks.

Preview

This feature is in Public Preview.

The Databricks Asset Bundle workflow supports both manual and templated creation of bundles. Templated bundles come in two flavors: ones that use default bundle templates, and ones that use custom bundle templates.

The default bundle template assumes a very specific configuration for simplicity, while the custom bundle template allows you to specify:

  • Folder structures

  • Computation and build steps and tasks

  • Tests

  • Other behaviors configurable in a DevOps infrastructure-as-code (IaC) environment

For example, if you routinely run Databricks jobs that require custom packages with a time-consuming compilation step upon installation, you can speed up your development loop by creating a bundle template that supports custom container environments.

Bundle templates define a directory structure that mirrors your intended bundle’s structure. They include a databricks_template_schema.json file that defines the necessary user-provided parameters for creating a new bundle. Let’s dive in and create a new template that builds custom container environments.

Which bundle template do I choose?

Simple flow chart showing which template to choose when creating a Databricks Asset Bundle

Before you start

  • If you haven’t, install the the Databricks CLI version 0.205 or above. If you’ve already installed it, confirm the version is 0.205 or higher by running databricks -version from a terminal.

Create a bundle template for running a container-based job

To make your first “container-job” bundle template, do the following from a terminal that can run Databricks CLI commands:

  1. Create an empty directory named dab-container-template:

    mkdir dab-container-template
    
  2. In the directory’s root, create a file named databricks_template_schema.json. This file contains the variables that must be provided by a user at bundle creation time:

    touch dab-container-template/databricks_template_schema.json
    
  3. Add the following contents to the databricks_template_schema.json and save the file. Each variable will be translated to a user prompt during bundle creation using the Databricks CLI:

    {
      "properties": {
        "project_name": {
          "type": "string",
          "default": "project_name",
          "description": "Project name",
          "order": 1
        }
      }
    }
    
  4. In the template directory, create subdirectories named resources and src. The template folder contains the directory structure for your generated bundles. The names of the subdirectories and files will follow Go package template syntax when derived from user values.

    mkdir -p "dab-container-template/template/resources"
    mkdir -p "dab-container-template/template/src"
    
  5. In the template directory, create a file named databricks.yml.tmpl and add the following contents:

    # This is a Databricks asset bundle definition for {{.project_name}}.
    # See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
    bundle:
      name: {{.project_name}}
    
    include:
      - resources/*.yml
    
    targets:
      # The 'dev' target, used for development purposes.
      # Whenever a developer deploys using 'dev', they get their own copy.
      dev:
        # We use 'mode: development' to make sure everything deployed to this target gets a prefix
        # like '[dev my_user_name]'. Setting this mode also disables any schedules and
        # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.
        mode: development
        default: true
        workspace:
          host: {{workspace_host}}
    
      # The 'prod' target, used for production deployment.
      prod:
        # For production deployments, we only have a single copy, so we override the
        # workspace.root_path default of
        # /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
        # to a path that is not specific to the current user.
        {{- /*
        Explaining 'mode: production' isn't as pressing as explaining 'mode: development'.
        As we already talked about the other mode above, users can just
        look at documentation or ask the assistant about 'mode: production'.
        #
        # By making use of 'mode: production' we enable strict checks
        # to make sure we have correctly configured this target.
        */}}
        mode: production
        workspace:
          host: {{workspace_host}}
          root_path: /Shared/.bundle/prod/${bundle.name}
        {{- if not is_service_principal}}
        run_as:
          # This runs as {{user_name}} in production. Alternatively,
          # a service principal could be used here using service_principal_name
          # (see Databricks documentation).
          user_name: {{user_name}}
        {{end -}}
    
  6. Create another YAML file named {{.project_name}}_job.yml.tmpl and place it in your resources directory. This new YAML file enables you to split your project’s job definitions from the rest of your bundle’s definition. Add the following YAML code to this file to describe your project’s job and runtime:

    # The main job for {{.project_name}}
    resources:
      jobs:
        {{.project_name}}_job:
          name: {{.project_name}}_job
          tasks:
            - task_key: python_task
              job_cluster_key: job_cluster
              spark_python_task:
                python_file: ../src/{{.project_name}}/task.py
          job_clusters:
            - job_cluster_key: job_cluster
              new_cluster:
                docker_image:
                  url: databricksruntime/python:10.4-LTS
                node_type_id: i3.xlarge
                spark_version: 13.3.x-scala2.12
    

    This is where you include your custom container image. In this step, you’ve specified one of the default Databricks base images, but you can customize this base image by installing packages specific to your project.

  7. Under your src directory, make a placeholder Python task file to run within your containerized environment:

    touch "src/{{.project_name}}/task.py"
    

    Now, add the following placeholder code to that file:

    import pyspark
    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.master('local[*]').appName('example').getOrCreate()
    
    print(f'Spark version{spark.version}')
    
  8. Review the structure of your bundle template. It should be as follows:

    .
    ├── databricks_template_schema.json
    └── template
        ├── databricks.yml.tmpl
        ├── resources
           └── {{.project_name}}_job.yml.tmpl
        └── src
            └── {{.project_name}}
                └── task.py
    

And your first custom bundle template is complete! To generate a bundle based off your new custom template, you can use the same databricks bundle init command you used before but with added parameters to specify our template’s location, like this:

databricks bundle init dab-container-job

With this new bundle template you can create bundles for running containerized workflows. This is useful for jobs that require a significant amount of time during cluster installation compared to job run time.

Next steps