Jobs API updates

Preview

This article discusses API updates that support the orchestration of multiple tasks using Databricks jobs, a feature in Public Preview. For information about creating, running, and managing single-task jobs using the generally-available Jobs API, see Jobs API.

You can now orchestrate multiple tasks with Databricks jobs. This article details changes to the Jobs API that support jobs with multiple tasks and provides guidance to help you update your existing API clients to work with this new feature.

This article refers to jobs defined with a single task as single-task format and jobs defined with multiple tasks as multi-task format.

Requirements

An administrator must enable support for jobs with multiple tasks in the Databricks admin console.

API changes

The Jobs API now defines the TaskSettings data structure to capture settings for each task in a job. For multi-task format jobs, the tasks field, an array of TaskSettings data structures, is included in JobSettings. Some fields previously part of JobSettings, for example, cluster specifications, are now part of the task settings for multi-task format jobs. JobSettings is also updated to include the format field. The format field indicates the format of the job and is a STRING value set to SINGLE_TASK or MULTI_TASK.

You need to update your existing API clients for these changes to JobSettings for multi-task format jobs. See the API client guide for more information on required changes.

An example JSON document representing a multi-task format job:

{
  "job_id": 53,
  "settings": {
    "name": "A job with multiple tasks",
    "email_notifications": {},
    "timeout_seconds": 0,
    "max_concurrent_runs": 1,
    "tasks": [
      {
        "task_key": "clean_data",
        "description": "Clean and prepare the data",
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/clean-data"
        },
        "existing_cluster_id": "1201-my-cluster",
        "max_retries": 3,
        "min_retry_interval_millis": 0,
        "retry_on_timeout": true,
        "timeout_seconds": 3600,
        "email_notifications": {}
      },
      {
        "task_key": "analyze_data",
        "description": "Perform an analysis of the data",
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/analyze-data"
        },
        "depends_on": [
          {
            "task_key": "clean_data"
          }
        ],
        "existing_cluster_id": "1201-my-cluster",
        "max_retries": 3,
        "min_retry_interval_millis": 0,
        "retry_on_timeout": true,
        "timeout_seconds": 3600,
        "email_notifications": {}
      }
    ],
    "format": "MULTI_TASK"
  },
  "created_time": 1625841911296,
  "creator_user_name": "user@databricks.com",
  "run_as_user_name": "user@databricks.com"
}

For single-task format jobs, the JobSettings data structure remains unchanged except for the addition of the format field. No TaskSettings array is included, and the task settings remain defined at the top level of the JobSettings data structure. You will not need to make changes to your existing API clients to process single-task format jobs.

An example JSON document representing a single-task format job:

{
  "job_id": 27,
  "settings": {
    "name": "Example notebook",
    "existing_cluster_id": "1201-my-cluster",
    "libraries": [
      {
        "jar": "dbfs:/FileStore/jars/spark_examples.jar"
      }
    ],
    "email_notifications": {},
    "timeout_seconds": 0,
    "schedule": {
      "quartz_cron_expression": "0 0 0 * * ?",
      "timezone_id": "US/Pacific",
      "pause_status": "UNPAUSED"
    },
    "notebook_task": {
      "notebook_path": "/notebooks/example-notebook",
      "revision_timestamp": 0
    },
    "max_concurrent_runs": 1,
    "format": "SINGLE_TASK"
  },
  "created_time": 1504128821443,
  "creator_user_name": "user@databricks.com"
}

API client guide

This section provides guidelines, examples, and required changes for API calls affected by the new multi-task format feature.

Create

To create a single-task format job through the Create API call, you do not need to change existing clients.

To create a multi-task format job, use the tasks field in JobSettings to specify settings for each task. The following example creates a job with two notebook tasks:

Note

A maximum of 100 tasks can be specified per job.

{
  "name": "Multi-task-job",
  "max_concurrent_runs": 1,
  "tasks": [
    {
      "task_key": "clean_data",
      "description": "Clean and prepare the data",
      "notebook_task": {
        "notebook_path": "/Users/user@databricks.com/clean-data"
      },
      "existing_cluster_id": "1201-my-cluster",
      "timeout_seconds": 3600,
      "max_retries": 3,
      "retry_on_timeout": true
    },
    {
      "task_key": "analyze_data",
      "description": "Perform an analysis of the data",
      "notebook_task": {
        "notebook_path": "/Users/user@databricks.com/analyze-data"
      },
      "depends_on": [
        {
          "task_key": "clean_data"
        }
      ],
      "existing_cluster_id": "1201-my-cluster",
      "timeout_seconds": 3600,
      "max_retries": 3,
      "retry_on_timeout": true
    }
  ]
}

Runs submit

To submit a one-time run of a single-task format job with the Runs submit API call, you do not need to change existing clients.

To submit a one-time run of a multi-task format job, use the tasks field in JobSettings to specify settings for each task. See Create for an example JobSettings specifying multiple tasks.

Update

To update a single-task format job with the Update API call, you do not need to change existing clients.

To update the settings of a multi-task format job, you must use the unique task_key field to identify new task settings. See Create for an example JobSettings specifying multiple tasks.

Reset

To overwrite the settings of a single-task format job with the Reset API call, you do not need to change existing clients.

To overwrite the settings of a multi-task format job, specify a JobSettings data structure with an array of TaskSettings data structures. See Create for an example JobSettings specifying multiple tasks.

List

For single-task format jobs, no client changes are required to process the response from the List API call.

For multi-task format jobs, cluster configuration and other settings are defined at the task level and not the job level. To modify clients to access cluster or task settings for a multi-task format job returned in the Job structure:

  • Parse the job_id field for the multi-task format job.
  • Pass the job_id to the Get API call to retrieve job details. See Get for an example response from the Get API call for a multi-task format job.

The following example shows a response containing single-task and multi-task format jobs:

{
  "jobs": [
    {
      "job_id": 36,
      "settings": {
        "name": "A job with a single task",
        "existing_cluster_id": "1201-my-cluster",
        "email_notifications": {},
        "timeout_seconds": 0,
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/example-notebook",
          "revision_timestamp": 0
        },
        "max_concurrent_runs": 1,
        "format": "SINGLE_TASK"
      },
      "created_time": 1505427148390,
      "creator_user_name": "user@databricks.com"
    },
    {
      "job_id": 53,
      "settings": {
        "name": "A job with multiple tasks",
        "email_notifications": {},
        "timeout_seconds": 0,
        "max_concurrent_runs": 1,
        "format": "MULTI_TASK"
      },
      "created_time": 1625841911296,
      "creator_user_name": "user@databricks.com"
    }
  ]
}

Get

For single-task format jobs, no client changes are required to process the response from the Get API call.

Multi-task format jobs return an array of task data structures containing task settings. If you require access to cluster or task level details, you need to modify your clients to iterate through the tasks array and extract required fields.

The following shows an example response from the Get API call for a multi-task format job:

{
  "job_id": 53,
  "settings": {
    "name": "A job with multiple tasks",
    "email_notifications": {},
    "timeout_seconds": 0,
    "max_concurrent_runs": 1,
    "tasks": [
      {
        "task_key": "clean_data",
        "description": "Clean and prepare the data",
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/clean-data"
        },
        "existing_cluster_id": "1201-my-cluster",
        "max_retries": 3,
        "min_retry_interval_millis": 0,
        "retry_on_timeout": true,
        "timeout_seconds": 3600,
        "email_notifications": {}
      },
      {
        "task_key": "analyze_data",
        "description": "Perform an analysis of the data",
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/analyze-data"
        },
        "depends_on": [
          {
            "task_key": "clean_data"
          }
        ],
        "existing_cluster_id": "1201-my-cluster",
        "max_retries": 3,
        "min_retry_interval_millis": 0,
        "retry_on_timeout": true,
        "timeout_seconds": 3600,
        "email_notifications": {}
      }
    ],
    "format": "MULTI_TASK"
  },
  "created_time": 1625841911296,
  "creator_user_name": "user@databricks.com",
  "run_as_user_name": "user@databricks.com"
}

Runs get

For single-task format jobs, no client changes are required to process the response from the Runs get API call.

The response for a multi-task format job run contains an array of tasks. To retrieve run results for each task:

  • Iterate through each of the tasks.
  • Parse the run_id for each task.
  • Call Runs get output with the run_id to get details on the run for each task.

An example response from the Runs get request:

{
  "job_id": 53,
  "run_id": 759600,
  "number_in_job": 7,
  "original_attempt_run_id": 759600,
  "state": {
    "life_cycle_state": "TERMINATED",
    "result_state": "SUCCESS",
    "state_message": ""
  },
  "cluster_spec": {},
  "start_time": 1595943854860,
  "setup_duration": 0,
  "execution_duration": 0,
  "cleanup_duration": 0,
  "trigger": "ONE_TIME",
  "creator_user_name": "user@databricks.com",
  "run_name": "Query logs",
  "run_type": "JOB_RUN",
  "tasks": [
    {
      "run_id": 759601,
      "task_key": "query-logs",
      "description": "Query session logs",
      "notebook_task": {
        "notebook_path": "/Users/user@databricks.com/log-query"
      },
      "existing_cluster_id": "1201-my-cluster",
      "state": {
        "life_cycle_state": "TERMINATED",
        "result_state": "SUCCESS",
        "state_message": ""
      }
    },
    {
      "run_id": 759602,
      "task_key": "validate_output",
      "description": "Validate query output",
      "depends_on": [
        {
          "task_key": "query-logs"
        }
      ],
      "notebook_task": {
        "notebook_path": "/Users/user@databricks.com/validate-query-results"
      },
      "existing_cluster_id": "1201-my-cluster",
      "state": {
        "life_cycle_state": "TERMINATED",
        "result_state": "SUCCESS",
        "state_message": ""
      }
    }
  ],
  "format": "MULTI_TASK"
}

Runs get output

For single-task format jobs, no client changes are required to process the response from the Runs get output API call.

For multi-task format jobs, calling Runs get output on a parent run results in an error since run output is available only for individual tasks. To get the output and metadata for a multi-task format job:

  • Call Runs get.
  • Iterate over the child run_id fields in the response.
  • Use the child run_id values to call Runs get output.

Runs list

For single-task format jobs, no client changes are required to process the response from the Runs list API call.

For multi-task format jobs, an empty tasks array is returned. Pass the run_id to the Runs get API call to retrieve the tasks. The following shows an example response from the Runs list API call for a multi-task format job:

{
  "runs": [
    {
      "job_id": 53,
      "run_id": 759600,
      "number_in_job": 7,
      "original_attempt_run_id": 759600,
      "state": {
          "life_cycle_state": "TERMINATED",
          "result_state": "SUCCESS",
          "state_message": ""
      },
      "cluster_spec": {},
      "start_time": 1595943854860,
      "setup_duration": 0,
      "execution_duration": 0,
      "cleanup_duration": 0,
      "trigger": "ONE_TIME",
      "creator_user_name": "user@databricks.com",
      "run_name": "Query logs",
      "run_type": "JOB_RUN",
      "tasks": [],
      "format": "MULTI_TASK"
    }
  ],
  "has_more": false
}

Data structures

TaskSettings

Settings for a task.

Field Name Type Description
task_key STRING

A unique name for the task. This field is used to refer to this task from other tasks.

This field is required and must be unique within its parent job.

On Update or Reset this field is used to reference the tasks to be updated or reset.

The maximum length is 100 characters.

description STRING

An optional description for this task.

The maximum length is 4096 bytes.

depends_on Array of task_key.

An optional array of objects specifying the dependency graph of the task. All tasks specified in this field must complete successfully before executing this task.

The key is task_key, and the value is the name assigned to the dependent task.

This field is required when a job consists of more than one task.

existing_cluster_id OR new_cluster STRING OR NewCluster

The cluster to use for execution of this task. This is either the ID of an existing cluster, or a description of a cluster to create for each run.

This field is required.

notebook_task OR spark_jar_task OR OR spark_python_task OR spark_submit_task OR pipeline_task NotebookTask OR SparkJarTask OR SparkPythonTask OR SparkSubmitTask OR PipelineTask

Definition of the task to execute.

This field is required.

timeout_seconds INT32

An optional timeout for each run of this task.

The default is to have no timeout.

max_retries INT32

An optional maximum number of times to retry an unsuccessful run.

The default behavior is never to retry.

min_retry_interval_millis INT32

An optional minimal interval in milliseconds between attempts.

The default behavior is to immediately retry unsuccessful runs.

retry_on_timeout BOOL

An optional flag to indicate whether to retry a job when it times out.

The default behavior is to not retry on timeout.

email_notifications JobEmailNotifications Optional email addresses to notify on task events.

PipelineTask

Settings for a pipeline task.

Field Name Type Description
pipeline_id STRING The full name of the pipeline task to execute.