Skip to main content

Add tasks to jobs in Databricks Asset Bundles

This page provides information about how to define job tasks in Databricks Asset Bundles. For information about job tasks, see Configure and edit tasks in Lakeflow Jobs.

important

The job git_source field and task source field set to GIT are not recommended for bundles, because local relative paths may not point to the same content in the Git repository. Bundles expect that a deployed job has the same files as the local copy from where it was deployed.

Instead, clone the repository locally and set up your bundle project within this repository, so that the source for tasks are the workspace.

Configure tasks

Define tasks for a job in a bundle in the tasks key for the job definition. Examples of task configuration for the available task types is in the Task settings section. For information about defining a job in a bundle, see job.

tip

To quickly generate resource configuration for an existing job using the Databricks CLI, you can use the bundle generate job command. See bundle commands.

To set task values, most job task types have task-specific parameters, but you can also define job parameters that get passed to tasks. Dynamic value references are supported for job parameters, which enable passing values that are specific to the job run between tasks. For complete information on how to pass task values by task type, see Details by task type.

You can also override general job task settings with settings for a target workspace. See Override with target settings.

The following example configuration defines a job with two notebook tasks, and passes a task value from the first task to the second task.

YAML
resources:
jobs:
pass_task_values_job:
name: pass_task_values_job
tasks:
# Output task
- task_key: output_value
notebook_task:
notebook_path: ../src/output_notebook.ipynb

# Input task
- task_key: input_value
depends_on:
- task_key: output_value
notebook_task:
notebook_path: ../src/input_notebook.ipynb
base_parameters:
received_message: '{{tasks.output_value.values.message}}'

The output_notebook.ipynb contains the following code, which sets a task value for the message key:

Python
# Databricks notebook source
# This first task sets a simple output value.

message = "Hello from the first task"

# Set the message to be used by other tasks
dbutils.jobs.taskValues.set(key="message", value=message)

print(f"Produced message: {message}")

The input_notebook.ipynb retrieves the value of the parameter received_message, that was set in the configuration for the task:

Python
# This notebook receives the message as a parameter.

dbutils.widgets.text("received_message", "")
received_message = dbutils.widgets.get("received_message")

print(f"Received message: {received_message}")

Task settings

This section contains settings and examples for each job task type.

Clean room notebook task

The clean room notebook task runs a clean rooms notebook when the clean_rooms_notebook_task field is present. For information about clean rooms, see What is Databricks Clean Rooms?.

The following keys are available for a clean rooms notebook task. For the corresponding REST API object definition, see clean_rooms_notebook_task.

Key

Type

Description

clean_room_name

String

Required. The clean room that the notebook belongs to.

etag

String

Checksum to validate the freshness of the notebook resource. It can be fetched by calling the clean room assets get operation.

object

Map

Base parameters to be used for the clean room notebook job.

notebook_name

String

Required. Name of the notebook being run.

Condition task

The condition_task enables you to add a task with if/else conditional logic to your job. The task evaluates a condition that can be used to control the execution of other tasks. The condition task does not require a cluster to execute and does not support retries or notifications. For more information about the if/else condition task, see Add branching logic to a job with the If/else task.

The following keys are available for a condition task. For the corresponding REST API object definition, see condition_task.

Key

Type

Description

left

String

Required. The left operand of the condition. Can be a string value or a job state or a dynamic value reference such as {{job.repair_count}} or {{tasks.task_key.values.output}}.

op

String

Required. The operator to use for comparison. Valid values are: EQUAL_TO, NOT_EQUAL, GREATER_THAN, GREATER_THAN_OR_EQUAL, LESS_THAN, LESS_THAN_OR_EQUAL.

right

String

Required. The right operand of the condition. Can be a string value or a job state or a dynamic value reference.

Examples

The following example contains a condition task and a notebook task, where the notebook task only executes if the number of job repairs is less than 5.

YAML
resources:
jobs:
my-job:
name: my-job
tasks:
- task_key: condition_task
condition_task:
op: LESS_THAN
left: '{{job.repair_count}}'
right: '5'
- task_key: notebook_task
depends_on:
- task_key: condition_task
outcome: 'true'
notebook_task:
notebook_path: ../src/notebook.ipynb

Dashboard task

You use this task to refresh a dashboard and send a snapshot to subscribers. For more information about dashboards in bundles, see dashboard.

The following keys are available for a dashboard task. For the corresponding REST API object definition, see dashboard_task.

Key

Type

Description

dashboard_id

String

Required. The identifier of the dashboard to be refreshed. The dashboard must already exist.

subscription

Map

The subscription configuration for sending the dashboard snapshot. Each subscription object can specify destination settings for where to send snapshots after the dashboard refresh completes. See subscription.

warehouse_id

String

The warehouse ID to execute the dashboard with for the schedule. If not specified, the default warehouse of the dashboard will be used.

Examples

The following example adds a dashboard task to a job. When the job is run, the dashboard with the specified ID is refreshed.

YAML
resources:
jobs:
my-dashboard-job:
name: my-dashboard-job
tasks:
- task_key: my-dashboard-task
dashboard_task:
dashboard_id: 11111111-1111-1111-1111-111111111111

dbt task

You use this task to run one or more dbt commands. For more information about dbt, see Connect to dbt Cloud.

The following keys are available for a dbt task. For the corresponding REST API object definition, see dbt_task.

Key

Type

Description

catalog

String

The name of the catalog to use. The catalog value can only be specified if a warehouse_id is specified. This field requires dbt-databricks >= 1.1.1.

commands

Sequence

Required. A list of dbt commands to execute in sequence. Each command must be a complete dbt command (e.g., dbt deps, dbt seed, dbt run, dbt test). A maximum of up to 10 commands can be provided.

profiles_directory

String

The path to the directory containing the dbt profiles.yml file. Can only be specified if no warehouse_id is specified. If no warehouse_id is specified and this folder is unset, the root directory is used.

project_directory

String

The path to the directory containing the dbt project. If not specified, defaults to the root of the repository or workspace directory. For projects stored in the Databricks workspace, the path must be absolute and begin with a slash. For projects in a remote repository, the path must be relative.

schema

String

The schema to write to. This parameter is only used when a warehouse_id is also provided. If not provided, the default schema is used.

source

String

The location type of the dbt project. Valid values are WORKSPACE and GIT. When set to WORKSPACE, the project will be retrieved from the Databricks workspace. When set to GIT, the project will be retrieved from a Git repository defined in git_source. If empty, the task uses GIT if git_source is defined and WORKSPACE otherwise.

warehouse_id

String

The ID of the SQL warehouse to use for running dbt commands. If not specified, the default warehouse will be used.

Examples

The following example adds a dbt task to a job. This dbt task uses the specified SQL warehouse to run the specified dbt commands.

To get a SQL warehouse's ID, open the SQL warehouse's settings page, then copy the ID found in parentheses after the name of the warehouse in the Name field on the Overview tab.

tip

Databricks Asset Bundles also includes a dbt-sql project template that defines a job with a dbt task, as well as dbt profiles for deployed dbt jobs. For information about Databricks Asset Bundles templates, see Default bundle templates.

YAML
resources:
jobs:
my-dbt-job:
name: my-dbt-job
tasks:
- task_key: my-dbt-task
dbt_task:
commands:
- 'dbt deps'
- 'dbt seed'
- 'dbt run'
project_directory: /Users/someone@example.com/Testing
warehouse_id: 1a111111a1111aa1
libraries:
- pypi:
package: 'dbt-databricks>=1.0.0,<2.0.0'

For each task

The for_each_task enables you to add a task with a for each loop to your job. The task executes a nested task for every input provided. For more information about the for_each_task, see Use a For each task to run another task in a loop.

The following keys are available for a for_each_task. For the corresponding REST API object definition, see for_each_task.

Key

Type

Description

concurrency

Integer

The maximum number of task iterations that can run concurrently. If not specified, all iterations may run in parallel subject to cluster and workspace limits.

inputs

String

Required. The input data for the loop. This can be a JSON string or a reference to an array parameter. Each element in the array will be passed to one iteration of the nested task.

task

Map

Required. The nested task definition to execute for each input. This object contains the complete task specification including task_key and the task type (e.g., notebook_task, python_wheel_task, etc.).

Examples

The following example adds a for_each_task to a job, where it loops over the values of another task and processes them.

YAML
resources:
jobs:
my_job:
name: my_job
tasks:
- task_key: generate_countries_list
notebook_task:
notebook_path: ../src/generate_countries_list.ipnyb
- task_key: process_countries
depends_on:
- task_key: generate_countries_list
for_each_task:
inputs: '{{tasks.generate_countries_list.values.countries}}'
task:
task_key: process_countries_iteration
notebook_task:
notebook_path: ../src/process_countries_notebook.ipnyb

JAR task

You use this task to run a JAR. You can reference local JAR libraries or those in a workspace, a Unity Catalog volume, or an external cloud storage location. See JAR file (Java or Scala).

For details on how to compile and deploy Scala JAR files on a Unity Catalog-enabled cluster in standard access mode, see Deploy Scala JARs on Unity Catalog clusters.

The following keys are available for a JAR task. For the corresponding REST API object definition, see jar_task.

Key

Type

Description

jar_uri

String

Deprecated. The URI of the JAR to be executed. DBFS and cloud storage paths are supported. This field is deprecated and should not be used. Instead, use the libraries field to specify JAR dependencies.

main_class_name

String

Required. The full name of the class containing the main method to be executed. This class must be contained in a JAR provided as a library. The code must use SparkContext.getOrCreate to obtain a Spark context; otherwise, runs of the job fail.

parameters

Sequence

The parameters passed to the main method. Use task parameter variables to set parameters containing information about job runs.

Examples

The following example adds a JAR task to a job. The path for the JAR is to a volume location.

YAML
resources:
jobs:
my-jar-job:
name: my-jar-job
tasks:
- task_key: my-jar-task
spark_jar_task:
main_class_name: org.example.com.Main
libraries:
- jar: /Volumes/main/default/my-volume/my-project-0.1.0-SNAPSHOT.jar

Notebook task

You use this task to run a notebook. See Notebook task for jobs.

The following keys are available for a notebook task. For the corresponding REST API object definition, see notebook_task.

Key

Type

Description

base_parameters

Map

The base parameters to use for each run of this job.

  • If the run is initiated by a call to jobs or run-now with parameters specified, the two parameters maps are merged.
  • If the same key is specified in base_parameters and in run-now, the value from run-now is used. Use task parameter variables to set parameters containing information about job runs.
  • If the notebook takes a parameter that is not specified in the job’s base_parameters or the run-now override parameters, the default value from the notebook is used. Retrieve these parameters in a notebook using dbutils.widgets.get.

notebook_path

String

Required. The path of the notebook in the Databricks workspace or remote repository, for example /Users/user.name@databricks.com/notebook_to_run. For notebooks stored in the Databricks workspace, the path must be absolute and begin with a slash. For notebooks stored in a remote repository, the path must be relative.

source

String

Location type of the notebook. Valid values are WORKSPACE and GIT. When set to WORKSPACE, the notebook will be retrieved from the local Databricks workspace. When set to GIT, the notebook will be retrieved from a Git repository defined in git_source. If the value is empty, the task will use GIT if git_source is defined and WORKSPACE otherwise.

warehouse_id

String

The ID of the warehouse to run the notebook on. Classic SQL warehouses are not supported. Use serverless or pro SQL warehouses instead. Note that SQL warehouses only support SQL cells. If the notebook contains non-SQL cells, the run will fail, so if you need to use Python (or other) in a cell, use serverless.

Examples

The following example adds a notebook task to a job and sets a job parameter named my_job_run_id. The path for the notebook to deploy is relative to the configuration file in which this task is declared. The task gets the notebook from its deployed location in the Databricks workspace.

YAML
resources:
jobs:
my-notebook-job:
name: my-notebook-job
tasks:
- task_key: my-notebook-task
notebook_task:
notebook_path: ./my-notebook.ipynb
parameters:
- name: my_job_run_id
default: '{{job.run_id}}'

Pipeline task

You use this task to run a pipeline. See Lakeflow Declarative Pipelines.

The following keys are available for a pipeline task. For the corresponding REST API object definition, see pipeline_task.

Key

Type

Description

full_refresh

Boolean

If true, a full refresh of the pipeline will be triggered, which will completely recompute all datasets in the pipeline. If false or omitted, only incremental data will be processed. For details, see Pipeline refresh semantics.

pipeline_id

String

Required. The ID of the pipeline to run. The pipeline must already exist.

Examples

The following example adds a pipeline task to a job. This task runs the specified pipeline.

tip

You can get a pipelines's ID by opening the pipeline in the workspace and copying the Pipeline ID value on the Pipeline details tab of the pipeline's settings page.

YAML
resources:
jobs:
my-pipeline-job:
name: my-pipeline-job
tasks:
- task_key: my-pipeline-task
pipeline_task:
pipeline_id: 11111111-1111-1111-1111-111111111111

Power BI task

Preview

Power BI task type is in Public Preview.

Use this task to trigger a refresh of a Power BI semantic model (formerly known as a dataset).

The following keys are available for a Power BI task. For the corresponding REST API object definition, see power_bi_task.

Key

Type

Description

connection_resource_name

String

Required. The name of the Unity Catalog connection to authenticate from Databricks to Power BI.

power_bi_model

String

Required. The name of the Power BI semantic model (dataset) to update.

refresh_after_update

Boolean

Whether to refresh the Power BI semantic model after the update completes. Defaults to false.

tables

Sequence

A list of tables (each as a Map) to be exported to Power BI. See tables.

warehouse_id

String

The ID of the SQL warehouse to use as the Power BI datasource.

Examples

The following example defines a Power BI task, which specifies a connection, the Power BI model to update, and the Databricks table to export.

YAML
resources:
jobs:
my_job:
name: my_job
tasks:
- task_key: power_bi_task
power_bi_task:
connection_resource_name: 'connection_name'
power_bi_model:
workspace_name: 'workspace_name'
model_name: 'model_name'
storage_mode: 'DIRECT_QUERY'
authentication_method: 'OAUTH'
overwrite_existing: false
refresh_after_update: false
tables:
- catalog: 'main'
schema: 'tpch'
name: 'customers'
storage_mode: 'DIRECT_QUERY'
warehouse_id: '1a111111a1111aa1'

Python script task

You use this task to run a Python file.

The following keys are available for a Python script task. For the corresponding REST API object definition, see python_task.

Key

Type

Description

parameters

Sequence

The parameters to pass to the Python file. Use task parameter variables to set parameters containing information about job runs.

python_file

String

Required. The URI of the Python file to be executed, for example /Users/someone@example.com/my-script.py. For python files stored in the Databricks workspace, the path must be absolute and begin with /. For files stored in a remote repository, the path must be relative. This field does not support dynamic value references such as variables.

source

String

The location type of the Python file. Valid values are WORKSPACE and GIT. When set to WORKSPACE, the file will be retrieved from the local Databricks workspace. When set to GIT, the file will be retrieved from a Git repository defined in git_source. If the value is empty, the task will use GIT if git_source is defined and WORKSPACE otherwise.

Examples

The following example adds a Python script task to a job. The path for the Python file to deploy is relative to the configuration file in which this task is declared. The task gets the Python file from its deployed location in the Databricks workspace.

YAML
resources:
jobs:
my-python-script-job:
name: my-python-script-job

tasks:
- task_key: my-python-script-task
spark_python_task:
python_file: ./my-script.py

Python wheel task

You use this task to run a Python wheel. See Build a Python wheel file using Databricks Asset Bundles.

The following keys are available for a Python wheel task. For the corresponding REST API object definition, see python_wheel_task.

Key

Type

Description

entry_point

String

Required. The named entry point to execute: function or class. If it does not exist in the metadata of the package it executes the function from the package directly using $packageName.$entryPoint().

named_parameters

Map

The named parameters to pass to the Python wheel task, also know as Keyword arguments. A named parameter is a key-value pair with a string key and a string value. Both parameters and named_parameters cannot be specified. If named_parameters is specified, the parameters are passed as keyword arguments to the entry point function.

package_name

String

Required. The name of the Python package to execute. All dependencies must be installed in the environment. This does not check for or install any package dependencies.

parameters

Sequence

The parameters to pass to the Python wheel task, also known as Positional arguments. Each parameter is a string. If specified, named_parameters must not be specified.

Examples

The following example adds a Python wheel task to a job. The path for the Python wheel file to deploy is relative to the configuration file in which this task is declared. See Databricks Asset Bundles library dependencies.

YAML
resources:
jobs:
my-python-wheel-job:
name: my-python-wheel-job
tasks:
- task_key: my-python-wheel-task
python_wheel_task:
entry_point: run
package_name: my_package
libraries:
- whl: ./my_package/dist/my_package-*.whl

Run job task

You use this task to run another job.

The following keys are available for a run job task. For the corresponding REST API object definition, see run_job_task.

Key

Type

Description

job_id

Integer

Required. The ID of the job to run. The job must already exist in the workspace.

job_parameters

Map

Job-level parameters to pass to the job being run. These parameters are accessible within the job's tasks.

pipeline_params

Map

Parameters for the pipeline task. Used only if the job being run contains a pipeline task. Can include full_refresh to trigger a full refresh of the pipeline.

Examples

The following example contains a run job task in the second job that runs the first job.

This example uses a substitution to retrieve the ID of the job to run. To get a job's ID from the UI, open the job in the workspace and copy the ID from the Job ID value in the Job details tab of the jobs's settings page.

YAML
resources:
jobs:
my-first-job:
name: my-first-job
tasks:
- task_key: my-first-job-task
new_cluster:
spark_version: '13.3.x-scala2.12'
node_type_id: 'i3.xlarge'
num_workers: 2
notebook_task:
notebook_path: ./src/test.py
my_second_job:
name: my-second-job
tasks:
- task_key: my-second-job-task
run_job_task:
job_id: ${resources.jobs.my-first-job.id}

SQL task

You use this task to run a SQL file, query, or alert.

The following keys are available for a SQL task. For the corresponding REST API object definition, see sql_task.

Key

Type

Description

alert

Map

Configuration for running a SQL alert. Contains:

  • alert_id (String): Required. The canonical identifier of the SQL alert to run.
  • pause_subscriptions (Boolean): Whether to pause alert subscriptions.
  • subscriptions (Sequence): List of subscription settings.

dashboard

Map

Configuration for refreshing a SQL dashboard. Contains:

  • dashboard_id (String): Required. The canonical identifier of the SQL dashboard to refresh.
  • custom_subject (String): Custom subject for the email sent to dashboard subscribers.
  • pause_subscriptions (Boolean): Whether to pause dashboard subscriptions.
  • subscriptions (Sequence): List of subscription settings.

file

Map

Configuration for running a SQL file. Contains:

  • path (String): Required. The path of the SQL file in the workspace or remote repository. For files stored in the Databricks workspace, the path must be absolute and begin with a slash. For files stored in a remote repository, the path must be relative.
  • source (String): The location type of the SQL file. Valid values are WORKSPACE and GIT.

parameters

Map

Parameters to be used for each run of this task. SQL queries and files can use these parameters by referencing them with the syntax {{parameter_key}}. Use task parameter variables to set parameters containing information about job runs.

query

Map

Configuration for running a SQL query. Contains:

  • query_id (String): Required. The canonical identifier of the SQL query to run.

warehouse_id

String

Required. The ID of the SQL warehouse to use to run the SQL task. The SQL warehouse must already exist.

Examples

tip

To get a SQL warehouse's ID, open the SQL warehouse's settings page, then copy the ID found in parentheses after the name of the warehouse in the Name field on the Overview tab.

The following example adds a SQL file task to a job. This SQL file task uses the specified SQL warehouse to run the specified SQL file.

YAML
resources:
jobs:
my-sql-file-job:
name: my-sql-file-job
tasks:
- task_key: my-sql-file-task
sql_task:
file:
path: /Users/someone@example.com/hello-world.sql
source: WORKSPACE
warehouse_id: 1a111111a1111aa1

The following example adds a SQL alert task to a job. This SQL alert task uses the specified SQL warehouse to refresh the specified SQL alert.

YAML
resources:
jobs:
my-sql-file-job:
name: my-sql-alert-job
tasks:
- task_key: my-sql-alert-task
sql_task:
warehouse_id: 1a111111a1111aa1
alert:
alert_id: 11111111-1111-1111-1111-111111111111

The following example adds a SQL query task to a job. This SQL query task uses the specified SQL warehouse to run the specified SQL query.

YAML
resources:
jobs:
my-sql-query-job:
name: my-sql-query-job
tasks:
- task_key: my-sql-query-task
sql_task:
warehouse_id: 1a111111a1111aa1
query:
query_id: 11111111-1111-1111-1111-111111111111

Other task settings

The following task settings allow you to configure behaviors for all tasks. For the corresponding REST API object definitions, see tasks.

Key

Type

Description

compute_key

String

The key of the compute resource to use for this task. If specified, new_cluster, existing_cluster_id, and job_cluster_key cannot be specified.

depends_on

Sequence

An optional list of task dependencies. Each item contains:

  • task_key (String): Required. The key of the task this task depends on.
  • outcome (String): Can be specified only for condition_task. If specified, the dependent task will only run if the condition evaluates to the specified outcome (either true or false).

description

String

An optional description for the task.

disable_auto_optimization

Boolean

Whether to disable automatic optimization for this task. If true, automatic optimizations like adaptive query execution will be disabled.

email_notifications

Map

An optional set of email addresses to notify when a run begins, completes, or fails. Each item contains:

  • on_start (Sequence): List of email addresses to notify when a run starts.
  • on_success (Sequence): List of email addresses to notify when a run completes successfully.
  • on_failure (Sequence): List of email addresses to notify when a run fails.
  • on_duration_warning_threshold_exceeded (Sequence): List of email addresses to notify when run duration exceeds the threshold.
  • on_streaming_backlog_suceeded (Sequence): List of email addresses to notify when any streaming backlog thresholds are exceeded for any stream.

environment_key

String

The key of an environment defined in the job's environments configuration. Used to specify environment-specific settings. This field is required for Python script, Python wheel and dbt tasks when using serverless compute.

existing_cluster_id

String

The ID of an existing cluster that will be used for all runs of this task.

health

Map

An optional specification for health monitoring of this task that includes a rules key, which is a list of health rules to evaluate.

job_cluster_key

String

The key of a job cluster defined in the job's job_clusters configuration.

libraries

Sequence

An optional list of libraries to be installed on the cluster that will execute the task. Each library is specified as a map with keys like jar, egg, whl, pypi, maven, cran, or requirements.

max_retries

Integer

An optional maximum number of times to retry the task if it fails. If not specified, the task will not be retried.

min_retry_interval_millis

Integer

An optional minimal interval in milliseconds between the start of the failed run and the subsequent retry run. If not specified, the default is 0 (immediate retry).

new_cluster

Map

A specification for a new cluster to be created for each run of this task. See cluster.

notification_settings

Map

Optional notification settings for this task. Each item contains:

  • no_alert_for_skipped_runs (Boolean): If true, do not send notifications for skipped runs.
  • no_alert_for_canceled_runs (Boolean): If true, do not send notifications for canceled runs.
  • alert_on_last_attempt (Boolean): If true, send notifications only on the last retry attempt.

retry_on_timeout

Boolean

An optional policy to specify whether to retry the task when it times out. If not specified, defaults to false.

run_if

String

An optional value indicating the condition under which the task should run. Valid values are:

  • ALL_SUCCESS (default): Run if all dependencies succeed.
  • AT_LEAST_ONE_SUCCESS: Run if at least one dependency succeeds.
  • NONE_FAILED: Run if no dependencies have failed.
  • ALL_DONE: Run when all dependencies complete, regardless of outcome.
  • AT_LEAST_ONE_FAILED: Run if at least one dependency fails.
  • ALL_FAILED: Run if all dependencies fail.

task_key

String

Required. A unique name for the task. This field is used to refer to this task from other tasks using the depends_on field.

timeout_seconds

Integer

An optional timeout applied to each run of this task. A value of 0 means no timeout. If not set, the default timeout from the cluster configuration is used.

webhook_notifications

Map

An optional set of system destinations to notify when a run begins, completes, or fails. Each item contains:

  • on_start (Sequence): List of notification destinations when a run starts.
  • on_success (Sequence): List of notification destinations when a run completes.
  • on_failure (Sequence): List of notification destinations when a run fails.
  • on_duration_warning_threshold_exceeded (Sequence): List of notification destinations when run duration exceeds threshold.
  • on_streaming_backlog_suceeded (Sequence): List of email addresses to notify when any streaming backlog thresholds are exceeded for any stream.