Job API

The Jobs API allows you to create/edit/delete jobs via the API. For the cost information, please see the Databricks pricing page. Note that if your plan has a discount for jobs-only clusters then clusters created via the Jobs API will be eligible for this discount.


Create

Endpoint HTTP Method
2.0/jobs/create POST

Creates a new job with the provided settings.

An example request for a job that runs at 10:15pm each night:

{
  "name": "Nightly model training",
  "new_cluster": {
    "spark_version": "2.0.x-scala2.10",
    "node_type_id": "r3.xlarge",
    "aws_attributes": {
      "availability": "ON_DEMAND"
    },
    "num_workers": 10
  },
  "libraries": [
    {
      "jar": "dbfs:/my-jar.jar"
    },
    {
      "maven": {
        "coordinates": "org.jsoup:jsoup:1.7.2"
      }
    }
  ],
  "email_notifications": {
    "on_start": [],
    "on_success": [],
    "on_failure": []
  },
  "timeout_seconds": 3600,
  "max_retries": 1,
  "schedule": {
    "quartz_cron_expression": "0 15 22 ? * *",
    "timezone_id": "America/Los_Angeles"
  },
  "spark_jar_task": {
    "main_class_name": "com.databricks.ComputeModels"
  }
}

And response:

{
  "job_id": 1
}

Request Structure

Creates a new job with the provided settings.

Field Name Type Description
existing_cluster_id OR new_cluster STRING OR NewCluster

If existing_cluster_id, the id of an existing cluster that will be used for all runs of this job. Please note that when running jobs on an existing cluster, you may need to manually restart the cluster if it stops responding. We suggest running jobs on new clusters for greater reliability.

Tip: Instances will be kept until their next hour of EC2 billing ends, without incurring additional EC2 costs. New clusters will reuse such instances.

If new_cluster, a description of a cluster that will be created for each run.

notebook_task OR spark_jar_task OR spark_python_task OR spark_submit_task NotebookTask OR SparkJarTask OR SparkPythonTask OR SparkSubmitTask

If notebook_task, indicates that this job should run a notebook. This field may not be specified in conjunction with spark_jar_task.

If spark_jar_task, indicates that this job should run a jar.

If spark_python_task, indicates that this job should run a python file.

If spark_submit_task, indicates that this job should run spark submit script.

name STRING An optional name for the job. The default value is Untitled.
libraries An array of Library An optional list of libraries to be installed on the cluster that will execute the job. The default value is an empty list.
email_notifications JobEmailNotifications An optional set of email addresses that will be notified when runs of this job begin or complete as well as when this job is deleted. The default behavior is to not send any emails.
timeout_seconds INT32 An optional timeout applied to each run of this job. The default behavior is to have no timeout.
max_retries INT32 An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with a FAILED result_state or INTERNAL_ERROR life_cycle_state. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry.
min_retry_interval_millis INT32 An optional minimal interval in milliseconds between attempts. The default behavior is that unsuccessful runs are immediately retried.
retry_on_timeout BOOL An optional policy to specify whether to retry a job when it times out. The default behavior is to not retry on timeout.
schedule CronSchedule An optional periodic schedule for this job. The default behavior is that the job will only run when triggered by clicking “Run Now” in the Jobs UI or sending an API request to runNow.
max_concurrent_runs INT32

An optional maximum allowed number of concurrent runs of the job.

Set this value if you want to be able to execute multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs which differ by their input parameters.

This setting affects only new runs. For example, suppose the job’s concurrency is 4 and there are 4 concurrent active runs. Then setting the concurrency to 3 won’t kill any of the active runs. However, from then on, new runs will be skipped unless there are fewer than 3 active runs.

This value cannot exceed 1000. Setting this value to 0 will cause all new runs to be skipped. The default behavior is to allow only 1 concurrent run.

Response Structure

Field Name Type Description
job_id INT64 The canonical identifier for the newly created job.

List

Endpoint HTTP Method
2.0/jobs/list GET

Lists all jobs. An example response:

{
  "jobs": [
    {
      "job_id": 1,
      "settings": {
        "name": "Nightly model training",
        "new_cluster": {
          "spark_version": "2.0.x-scala2.10",
          "node_type_id": "r3.xlarge",
          "aws_attributes": {
            "availability": "ON_DEMAND"
          },
          "num_workers": 10
        },
        "libraries": [
          {
            "jar": "dbfs:/my-jar.jar"
          },
          {
            "maven": {
              "coordinates": "org.jsoup:jsoup:1.7.2"
            }
          }
        ],
        "email_notifications": {
          "on_start": [],
          "on_success": [],
          "on_failure": []
        },
        "timeout_seconds": 100000000,
        "max_retries": 1,
        "schedule": {
          "quartz_cron_expression": "0 15 22 ? * *",
          "timezone_id": "America/Los_Angeles"
        },
        "spark_jar_task": {
          "main_class_name": "com.databricks.ComputeModels"
        }
      },
      "created_time": 1457570074236
    }
  ]
}

Response Structure

Field Name Type Description
jobs An array of Job The list of jobs.

Delete

Endpoint HTTP Method
2.0/jobs/delete POST

Deletes the job and sends an email to the addresses specified in JobSettings.email_notifications. No action will occur if the job has already been removed. After the job is removed, neither its details or its run history will be visible via the Jobs UI or API. The job is guaranteed to be removed upon completion of this request. However, runs that were active before the receipt of this request may still be active. They will be terminated asynchronously.

An example request:

{
  "job_id": 1
}

Request Structure

Deletes a job and sends an email to the addresses specified in JobSettings.email_notifications. No action will occur if the job has already been removed. After the job is removed, neither its details nor its run history will be visible via the Jobs UI or API. The job is guaranteed to be removed upon completion of this request. However, runs that were active before the receipt of this request may still be active. They will be terminated asynchronously.

Field Name Type Description
job_id INT64 The canonical identifier of the job to delete. This field is required.

Get

Endpoint HTTP Method
2.0/jobs/get GET

Retrieves information about a single job. An example request:

/jobs/get?job_id=1

An example response:

{
  "job_id": 1,
  "settings": {
    "name": "Nightly model training",
    "new_cluster": {
      "spark_version": "2.0.x-scala2.10",
      "node_type_id": "r3.xlarge",
      "aws_attributes": {
        "availability": "ON_DEMAND"
      },
      "num_workers": 10
    },
    "libraries": [
      {
        "jar": "dbfs:/my-jar.jar"
      },
      {
        "maven": {
          "coordinates": "org.jsoup:jsoup:1.7.2"
        }
      }
    ],
    "email_notifications": {
      "on_start": [],
      "on_success": [],
      "on_failure": []
    },
    "timeout_seconds": 100000000,
    "max_retries": 1,
    "schedule": {
      "quartz_cron_expression": "0 15 22 ? * *",
      "timezone_id": "America/Los_Angeles"
    },
    "spark_jar_task": {
      "main_class_name": "com.databricks.ComputeModels"
    }
  },
  "created_time": 1457570074236
}

Request Structure

Retrieves information about a single job.

Field Name Type Description
job_id INT64 The canonical identifier of the job to retrieve information about. This field is required.

Response Structure

Field Name Type Description
job_id INT64 The canonical identifier for this job.
creator_user_name STRING The creator user name. This field won’t be included in the response if the user has already been deleted.
settings JobSettings Settings for this job and all of its runs. These settings can be updated using the resetJob method.
created_time INT64 The time at which this job was created in epoch milliseconds (milliseconds since 1/1/1970).

Reset

Endpoint HTTP Method
2.0/jobs/reset POST

Overwrites the settings of a job with the provided settings.

An example request that makes job 2 look like job 1 (from the create_job example):

{
  "job_id": 2,
  "new_settings": {
    "name": "Nightly model training",
    "new_cluster": {
      "spark_version": "2.0.x-scala2.10",
      "node_type_id": "r3.xlarge",
      "aws_attributes": {
        "availability": "ON_DEMAND"
      },
      "num_workers": 10
    },
    "libraries": [
      {
        "jar": "dbfs:/my-jar.jar"
      },
      {
        "maven": {
          "coordinates": "org.jsoup:jsoup:1.7.2"
        }
      }
    ],
    "email_notifications": {
      "on_start": [],
      "on_success": [],
      "on_failure": []
    },
    "timeout_seconds": 100000000,
    "max_retries": 1,
    "schedule": {
      "quartz_cron_expression": "0 15 22 ? * *",
      "timezone_id": "America/Los_Angeles"
    },
    "spark_jar_task": {
      "main_class_name": "com.databricks.ComputeModels"
    }
  }
}

Request Structure

Overwrites the settings of the job with the provided settings.

Field Name Type Description
job_id INT64 The canonical identifier of the job to reset. This field is required.
new_settings JobSettings

The new settings of the job. These new settings will replace the old settings entirely.

Changes to the following fields will not be applied to active runs: JobSettings.cluster_spec or JobSettings.task.

Changes to the following fields will be applied to active runs as well as future runs: JobSettings.timeout_second, JobSettings.email_notifications, or JobSettings.retry_policy. This field is required.


Run Now

Endpoint HTTP Method
2.0/jobs/run-now POST

Runs the job now, and returns the run_id of the triggered run.

Note

If you find yourself using Create together with Run Now a lot, you may actually be interested in the Runs Submit API. This API endpoint allows you to submit your workloads directly without having to create a job in Databricks.

An example request for a notebook job:

{
  "job_id": 1,
  "notebook_params": {
    "dry-run": "true",
    "oldest-time-to-consider": "1457570074236"
  }
}

An example request for a jar job:

{
  "job_id": 2,
  "jar_params": ["param1", "param2"]
}

Request Structure

Runs the job now, and returns the run_id of the triggered run.

Field Name Type Description
job_id INT64  
jar_params An array of STRING A list of parameters for jobs with jar tasks, e.g. "jar_params": ["john doe", "35"]. The parameters will be used to invoke the main function of the main class specified in the spark jar task. If not specified upon run-now, it will default to an empty list. jar_params cannot be specified in conjunction with notebook_params. The json representation of this field (i.e. {"jar_params":["john doe","35"]}) cannot exceed 10,000 bytes.
notebook_params An array of ParamPair

A map from keys to values for jobs with notebook task, e.g. "notebook_params": {"name": "john doe", "age":  "35"}. The map is passed to the notebook and will be accessible through the dbutils.widgets.get function. See Widgets for more information.

If not specified upon run-now, the triggered run will use the job’s base parameters.

notebook_params cannot be specified in conjunction with jar_params.

The json representation of this field (i.e. {"notebook_params":{"name":"john doe","age":"35"}}) cannot exceed 10,000 bytes.

Response Structure

Field Name Type Description
run_id INT64 The globally unique id of the newly triggered run.
number_in_job INT64 The sequence number of this run among all runs of the job.

Runs Submit

Endpoint HTTP Method
2.0/jobs/runs/submit POST

Submit a one-time run with the provided settings. This endpoint doesn’t require a Databricks job to be created. You can directly submit your workload. Runs submitted via this endpoint don’t show up in the UI. Once the run is submitted, you can use the jobs/runs/get API to check the run state.

An example request:

{
  "run_name": "my spark task",
  "new_cluster": {
    "spark_version": "2.0.x-scala2.10",
    "node_type_id": "r3.xlarge",
    "aws_attributes": {
      "availability": "ON_DEMAND"
    },
    "num_workers": 10
  },
  "libraries": [
    {
      "jar": "dbfs:/my-jar.jar"
    },
    {
      "maven": {
        "coordinates": "org.jsoup:jsoup:1.7.2"
      }
    }
  ],
  "timeout_seconds": 3600,
  "spark_jar_task": {
    "main_class_name": "com.databricks.ComputeModels"
  }
}

And response:

{
  "run_id": 123
}

Request Structure

Submit a new run with the provided settings.

Field Name Type Description
existing_cluster_id OR new_cluster STRING OR NewCluster

If existing_cluster_id, the id of an existing cluster that will be used for all runs of this job. Please note that when running jobs on an existing cluster, you may need to manually restart the cluster if it stops responding. We suggest running jobs on new clusters for greater reliability.

Tip: Instances will be kept until their next hour of EC2 billing ends, without incurring additional EC2 costs. New clusters will reuse such instances.

If new_cluster, a description of a cluster that will be created for each run.

notebook_task OR spark_jar_task OR spark_python_task OR spark_submit_task NotebookTask OR SparkJarTask OR SparkPythonTask OR SparkSubmitTask

If notebook_task, indicates that this job should run a notebook. This field may not be specified in conjunction with spark_jar_task.

If spark_jar_task, indicates that this job should run a jar.

If spark_python_task, indicates that this job should run a python file.

If spark_submit_task, indicates that this job should run spark submit script.

run_name STRING An optional name for the run. The default value is Untitled.
libraries An array of Library An optional list of libraries to be installed on the cluster that will execute the job. The default value is an empty list.
timeout_seconds INT32 An optional timeout applied to each run of this job. The default behavior is to have no timeout.

Response Structure

Field Name Type Description
run_id INT64 The canonical identifier for the newly submitted run.

Runs List

Endpoint HTTP Method
2.0/jobs/runs/list GET

Lists runs from most recently started to least.

Note

Runs are automatically removed after 60 days. We recommend you to save old run results through the UI before they expire to reference them in future. See Exporting Job Run Results for details.

An example request:

/jobs/runs/list?job_id=1&active_only=false&offset=1&limit=1

And response:

{
  "runs": [
    {
      "job_id": 1,
      "run_id": 452,
      "number_in_job": 5,
      "state": {
        "life_cycle_state": "RUNNING",
        "state_message": "Performing action"
      },
      "task": {
        "notebook_task": {
          "notebook_path": "/Users/donald@duck.com/my-notebook"
        }
      },
      "cluster_spec": {
        "existing_cluster_id": "1201-my-cluster"
      },
      "cluster_instance": {
        "cluster_id": "1201-my-cluster",
        "spark_context_id": "1102398-spark-context-id"
      },
      "overriding_parameters": {
        "jar_params": ["param1", "param2"]
      },
      "start_time": 1457570074236,
      "setup_duration": 259754,
      "execution_duration": 3589020,
      "cleanup_duration": 31038,
      "trigger": "PERIODIC"
    }
  ],
  "has_more": true
}

Request Structure

Lists runs from most recently started to least.

Field Name Type Description
active_only OR completed_only BOOL OR BOOL

If active_only, if true, only active runs will be included in the results; otherwise, lists both active and completed runs.

Note: This field cannot be true when completed_only is true.

If completed_only, if true, only completed runs will be included in the results; otherwise, lists both active and completed runs.

Note: This field cannot be true when active_only is true.

job_id INT64 The job for which to list runs. If omitted, the Jobs service will list runs from all jobs.
offset INT32 The offset of the first run to return, relative to the most recent run.
limit INT32 The number of runs to return. This value should be greater than 0 and less than 1000. The default value is 20. If a request specifies a limit of 0, the service will instead use the maximum limit.

Response Structure

Field Name Type Description
runs An array of Run A list of runs, from most recently started to least.
has_more BOOL If true, additional runs matching the provided filter are available for listing.

Runs Get

Endpoint HTTP Method
2.0/jobs/runs/get GET

Retrieves the metadata of a run.

Note

Runs are automatically removed after 60 days. We recommend you to save old run results through the UI before they expire to reference them in future. See Exporting Job Run Results for details.

An example request:

/jobs/runs/get?run_id=452

An example response:

{
  "job_id": 1,
  "run_id": 452,
  "number_in_job": 5,
  "state": {
    "life_cycle_state": "RUNNING",
    "state_message": "Performing action"
  },
  "task": {
    "notebook_task": {
      "notebook_path": "/Users/donald@duck.com/my-notebook"
    }
  },
  "cluster_spec": {
    "existing_cluster_id": "1201-my-cluster"
  },
  "cluster_instance": {
    "cluster_id": "1201-my-cluster",
    "spark_context_id": "1102398-spark-context-id"
  },
  "overriding_parameters": {
    "jar_params": ["param1", "param2"]
  },
  "start_time": 1457570074236,
  "setup_duration": 259754,
  "execution_duration": 3589020,
  "cleanup_duration": 31038,
  "trigger": "PERIODIC"
}

Request Structure

Retrieves the metadata of a run without any output.

Field Name Type Description
run_id INT64 The canonical identifier of the run for which to retrieve the metadata. This field is required.

Response Structure

Field Name Type Description
job_id INT64 The canonical identifier of the job that contains this run.
run_id INT64 The canonical identifier of the run. This id is unique across all runs of all jobs.
creator_user_name STRING The creator user name. This field won’t be included in the response if the user has already been deleted.
number_in_job INT64 The sequence number of this run among all runs of the job. This value starts at 1.
original_attempt_run_id INT64 If this run is a retry of a prior run attempt, this field contains the run_id of the original attempt; otherwise, it is the same as the run_id.
state RunState The result and lifecycle states of the run.
schedule CronSchedule The cron schedule that triggered this run if it was triggered by the periodic scheduler.
task JobTask The task performed by the run, if any.
cluster_spec ClusterSpec A snapshot of the job’s cluster specification when this run was created.
cluster_instance ClusterInstance The cluster used for this run. If the run is specified to use a new cluster, this field will be set once the Jobs service has requested a cluster for the run.
overriding_parameters RunParameters The parameters used for this run.
start_time INT64 The time at which this run was started in epoch milliseconds (milliseconds since 1/1/1970). Note that this may not be the time when the job task starts executing, for example, if the job is scheduled to run on a new cluster, this is the time the cluster creation call is issued.
setup_duration INT64 The time it took to set up the cluster in milliseconds. For runs that run on new clusters this is the cluster creation time, for runs that run on existing clusters this time should be very short.
execution_duration INT64 The time in milliseconds it took to execute the commands in the jar or notebook until they completed, failed, timed out, were cancelled, or encountered an unexpected error.
cleanup_duration INT64 The time in milliseconds it took to terminate the cluster and clean up any intermediary results, etc. Note that the total duration of the run is the sum of the setup_duration, the execution_duration and the cleanup_duration.
trigger TriggerType The type of trigger that fired this run, e.g., a periodic schedule or a one time run.

Runs Cancel

Endpoint HTTP Method
2.0/jobs/runs/cancel POST

Cancels a run. The run is canceled asynchronously, so when this request completes, the run may still be running. The run will be terminated shortly. If the run is already in a terminal life_cycle_state, this method is a no-op.

An example request:

{
  "run_id": 453
}

Request Structure

Cancels a run. The run is canceled asynchronously, so when this request completes the run may be still be active. The run will be terminated as soon as possible.

Field Name Type Description
run_id INT64 This field is required.

Data Structures

ClusterInstance

Identifiers for the cluster and Spark context used by a run. These two values together identify an execution context across all time.

Field Name Type Description
cluster_id STRING

The canonical identifier for the cluster used by a run. This field is always available for runs on existing clusters. For runs on new clusters, it becomes available once the cluster is created. This value can be used to view logs by browsing to /#setting/sparkui/$cluster_id/driver-logs. The logs will continue to be available after the run completes.

If this identifier is not yet available, the response won’t include this field.

spark_context_id STRING

The canonical identifier for the Spark context used by a run. This field will be filled in once the run begins execution. This value can be used to view the Spark UI by browsing to /#setting/sparkui/$cluster_id/$spark_context_id. The Spark UI will continue to be available after the run has completed.

If this identifier is not yet available, the response won’t include this field.

ClusterSpec

Field Name Type Description
existing_cluster_id OR new_cluster STRING OR NewCluster

If existing_cluster_id, the id of an existing cluster that will be used for all runs of this job. Please note that when running jobs on an existing cluster, you may need to manually restart the cluster if it stops responding. We suggest running jobs on new clusters for greater reliability.

Tip: Instances will be kept until their next hour of EC2 billing ends, without incurring additional EC2 costs. New clusters will reuse such instances.

If new_cluster, a description of a cluster that will be created for each run.

libraries An array of Library An optional list of libraries to be installed on the cluster that will execute the job. The default value is an empty list.

CronSchedule

Field Name Type Description
quartz_cron_expression STRING A cron expression using quartz syntax that describes the schedule for a job. See Quartz for details. This field is required.
timezone_id STRING A Java timezone id. The schedule for a job will be resolved with respect to this timezone. See Java TimeZone for details. This field is required.

Job

Field Name Type Description
job_id INT64 The canonical identifier for this job.
creator_user_name STRING The creator user name. This field won’t be included in the response if the user has already been deleted.
settings JobSettings Settings for this job and all of its runs. These settings can be updated using the resetJob method.
created_time INT64 The time at which this job was created in epoch milliseconds (milliseconds since 1/1/1970).

JobEmailNotifications

Field Name Type Description
on_start An array of STRING A list of email addresses to be notified when a run begins. If not specified upon job creation or reset, the list will be empty, i.e., no address will be notified.
on_success An array of STRING A list of email addresses to be notified when a run successfully completes. A run is considered to have completed successfully if it ends with a TERMINATED life_cycle_state and a SUCCESSFUL result_state. If not specified upon job creation or reset, the list will be empty, i.e., no address will be notified.
on_failure An array of STRING A list of email addresses to be notified when a run unsuccessfully completes. A run is considered to have completed unsuccessfully if it ends with an INTERNAL_ERROR life_cycle_state or a SKIPPED, FAILED, or TIMED_OUT result_state. If not specified upon job creation or reset, the list will be empty, i.e., no address will be notified.

JobSettings

Settings for a job. These settings can be updated using the resetJob method.

Field Name Type Description
existing_cluster_id OR new_cluster STRING OR NewCluster

If existing_cluster_id, the id of an existing cluster that will be used for all runs of this job. Please note that when running jobs on an existing cluster, you may need to manually restart the cluster if it stops responding. We suggest running jobs on new clusters for greater reliability.

Tip: Instances will be kept until their next hour of EC2 billing ends, without incurring additional EC2 costs. New clusters will reuse such instances.

If new_cluster, a description of a cluster that will be created for each run.

notebook_task OR spark_jar_task OR spark_python_task OR spark_submit_task NotebookTask OR SparkJarTask OR SparkPythonTask OR SparkSubmitTask

If notebook_task, indicates that this job should run a notebook. This field may not be specified in conjunction with spark_jar_task.

If spark_jar_task, indicates that this job should run a jar.

If spark_python_task, indicates that this job should run a python file.

If spark_submit_task, indicates that this job should run spark submit script.

name STRING An optional name for the job. The default value is Untitled.
libraries An array of Library An optional list of libraries to be installed on the cluster that will execute the job. The default value is an empty list.
email_notifications JobEmailNotifications An optional set of email addresses that will be notified when runs of this job begin or complete as well as when this job is deleted. The default behavior is to not send any emails.
timeout_seconds INT32 An optional timeout applied to each run of this job. The default behavior is to have no timeout.
max_retries INT32 An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with a FAILED result_state or INTERNAL_ERROR life_cycle_state. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry.
min_retry_interval_millis INT32 An optional minimal interval in milliseconds between attempts. The default behavior is that unsuccessful runs are immediately retried.
retry_on_timeout BOOL An optional policy to specify whether to retry a job when it times out. The default behavior is to not retry on timeout.
schedule CronSchedule An optional periodic schedule for this job. The default behavior is that the job will only run when triggered by clicking “Run Now” in the Jobs UI or sending an API request to runNow.
max_concurrent_runs INT32

An optional maximum allowed number of concurrent runs of the job.

Set this value if you want to be able to execute multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs which differ by their input parameters.

This setting affects only new runs. For example, suppose the job’s concurrency is 4 and there are 4 concurrent active runs. Then setting the concurrency to 3 won’t kill any of the active runs. However, from then on, new runs will be skipped unless there are fewer than 3 active runs.

This value cannot exceed 1000. Setting this value to 0 will cause all new runs to be skipped. The default behavior is to allow only 1 concurrent run.

JobTask

Field Name Type Description
notebook_task OR spark_jar_task OR spark_python_task OR spark_submit_task NotebookTask OR SparkJarTask OR SparkPythonTask OR SparkSubmitTask

If notebook_task, indicates that this job should run a notebook. This field may not be specified in conjunction with spark_jar_task.

If spark_jar_task, indicates that this job should run a jar.

If spark_python_task, indicates that this job should run a python file.

If spark_submit_task, indicates that this job should run spark submit script.

NewCluster

Field Name Type Description
num_workers OR autoscale INT32 OR AutoScale

If num_workers, number of worker nodes that this cluster should have. A cluster has one Spark Driver and num_workers Executors for a total of num_workers + 1 Spark nodes.

Note: When reading the properties of a cluster, this field reflects the desired number of workers rather than the actual current number of workers. For instance, if a cluster is resized from 5 to 10 workers, this field will immediately be updated to reflect the target size of 10 workers, whereas the workers listed in spark_info will gradually increase from 5 to 10 as the new nodes are provisioned.

If autoscale, parameters needed in order to automatically scale clusters up and down based on load. Note: autoscaling works best with DB runtime versions 3.0 or later.

cluster_name STRING Cluster name requested by the user. This doesn’t have to be unique. If not specified at creation, the cluster name will be an empty string.
spark_version STRING The Spark version of the cluster, e.g. “1.4.x-ubuntu15.10”. This is an optional parameter at cluster creation. If not specified, a default version of Spark will be used. A list of available Spark versions and the default value can be retrieved by using the Spark Versions API call.
spark_conf An array of SparkConfPair

An object containing a set of optional, user-specified Spark configuration key-value pairs. Users can also pass in a string of extra JVM options to the driver and the executors via spark.driver.extraJavaOptions and spark.executor.extraJavaOptions respectively.

Example Spark confs: {"spark.speculation": true, "spark.streaming.ui.retainedBatches": 5} or {"spark.driver.extraJavaOptions": "-verbose:gc -XX:+PrintGCDetails"}

aws_attributes AwsAttributes Attributes related to clusters running on Amazon Web Services. If not specified at cluster creation, a set of default values will be used.
node_type_id STRING This field encodes, through a single value, the resources available to each of the Spark nodes in this cluster. For example, the Spark nodes can be provisioned and optimized for memory or compute intensive workloads If not specified at cluster creation, a default value will be used. A list of available node types and the default value can be retrieved by using the List Node Types API call.
driver_node_type_id STRING The node type of the Spark driver. Note that this field is optional; if unset, the driver node type will be set as the same value as node_type_id defined above.
ssh_public_keys An array of STRING SSH public key contents that will be added to each Spark node in this cluster. The corresponding private keys can be used to login with the user name ubuntu on port 2200. Up to 10 keys can be specified.
custom_tags An array of ClusterTag

Additional tags for cluster resources. Databricks will tag all cluster resources (e.g., AWS instances and EBS volumes) with these tags in addition to default_tags. Notes:

  • Tags are not supported on legacy node types such as compute-optimized and memory-optimized
  • Currently, Databricks allows at most 45 custom tags
  • Clusters can only reuse cloud resources if the resources’ tags are a subset of the cluster tags
cluster_log_conf ClusterLogConf The configuration for delivering spark logs to a long-term storage destination. Two kinds of destinations (dbfs and s3) are supported. Only one destination can be specified for one cluster. If the conf is given, the logs will be delivered to the destination every 5 mins. The destination of driver logs is $destination/$clusterId/driver, while the destination of executor logs is $destination/$clusterId/executor.
spark_env_vars An array of SparkEnvPair

An object containing a set of optional, user-specified environment variable key-value pairs. Please note that key-value pair of the form (X,Y) will be exported as is (i.e., export X='Y') while launching the driver and workers.

In order to specify an additional set of SPARK_DAEMON_JAVA_OPTS, we recommend appending them to $SPARK_DAEMON_JAVA_OPTS as shown in the example below. This ensures that all default databricks managed environmental variables are included as well.

Example Spark environment variables: {"SPARK_WORKER_MEMORY": "28000m", "SPARK_LOCAL_DIRS": "/local_disk0"} or {"SPARK_DAEMON_JAVA_OPTS": "$SPARK_DAEMON_JAVA_OPTS -Dspark.shuffle.service.enabled=true"}

autotermination_minutes INT32 Automatically terminates the cluster after it is inactive for this time in minutes. If not set, this cluster will not be automatically terminated. If specified, the threshold must be between 10 and 10000 minutes. Users can also set this value to 0 to explicitly disable automatic termination.
enable_elastic_disk BOOL Autoscaling Local Storage: when enabled, this cluster will dynamically acquire additional disk space when its Spark workers are running low on disk space. This feature requires specific AWS permissions to function correctly - refer to the User Guide for more details.

NotebookTask

Field Name Type Description
     
notebook_path STRING The absolute path of the notebook to be run in the Databricks Workspace. This path must begin with a slash. Relative paths will be supported in the future. This field is required.
base_parameters An array of ParamPair

Base parameters to be used for each run of this job. If the run is initiated by a call to run-now with parameters specified, the two parameters maps will be merged. If the same key is specified in base_parameters and in run-now, the value from run-now will be used.

If the notebook takes a parameter that is not specified in the job’s base_parameters or the run-now override parameters, the default value from the notebook will be used.

These parameters can be retrieved in a notebook by using dbutils.widgets.get().

ParamPair

Name-based parameters for jobs running notebook tasks.

Field Name Type Description
key STRING Named parameter, can be passed to dbutils.widgets.get() to retrieve the corresponding value.
value STRING Value of named parameter, returned by calls to dbutils.widgets.get() in notebooks.

Run

All the information about a run except for its output. The output can be retrieved separately with the getRunOutput method.

Field Name Type Description
job_id INT64 The canonical identifier of the job that contains this run.
run_id INT64 The canonical identifier of the run. This id is unique across all runs of all jobs.
creator_user_name STRING The creator user name. This field won’t be included in the response if the user has already been deleted.
number_in_job INT64 The sequence number of this run among all runs of the job. This value starts at 1.
original_attempt_run_id INT64 If this run is a retry of a prior run attempt, this field contains the run_id of the original attempt; otherwise, it is the same as the run_id.
state RunState The result and lifecycle states of the run.
schedule CronSchedule The cron schedule that triggered this run if it was triggered by the periodic scheduler.
task JobTask The task performed by the run, if any.
cluster_spec ClusterSpec A snapshot of the job’s cluster specification when this run was created.
cluster_instance ClusterInstance The cluster used for this run. If the run is specified to use a new cluster, this field will be set once the Jobs service has requested a cluster for the run.
overriding_parameters RunParameters The parameters used for this run.
start_time INT64 The time at which this run was started in epoch milliseconds (milliseconds since 1/1/1970). Note that this may not be the time when the job task starts executing, for example, if the job is scheduled to run on a new cluster, this is the time the cluster creation call is issued.
setup_duration INT64 The time it took to set up the cluster in milliseconds. For runs that run on new clusters this is the cluster creation time, for runs that run on existing clusters this time should be very short.
execution_duration INT64 The time in milliseconds it took to execute the commands in the jar or notebook until they completed, failed, timed out, were cancelled, or encountered an unexpected error.
cleanup_duration INT64 The time in milliseconds it took to terminate the cluster and clean up any intermediary results, etc. Note that the total duration of the run is the sum of the setup_duration, the execution_duration and the cleanup_duration.
trigger TriggerType The type of trigger that fired this run, e.g., a periodic schedule or a one time run.

RunParameters

Parameters for this run. Only one of jar_params, python_params or notebook_params should be specified in the run-now request, depending on the type of job task. Jobs with jar task or python task take a list of position-based parameters, and jobs with notebook tasks take a key value map.

Field Name Type Description
jar_params An array of STRING A list of parameters for jobs with jar tasks, e.g. "jar_params": ["john doe", "35"]. The parameters will be used to invoke the main function of the main class specified in the spark jar task. If not specified upon run-now, it will default to an empty list. jar_params cannot be specified in conjunction with notebook_params. The json representation of this field (i.e. {"jar_params":["john doe","35"]}) cannot exceed 10,000 bytes.
notebook_params An array of ParamPair

A map from keys to values for jobs with notebook task, e.g. "notebook_params": {"name": "john doe", "age":  "35"}. The map is passed to the notebook and will be accessible through the dbutils.widgets.get function. See Widgets for more information.

If not specified upon run-now, the triggered run will use the job’s base parameters.

notebook_params cannot be specified in conjunction with jar_params.

The json representation of this field (i.e. {"notebook_params":{"name":"john doe","age":"35"}}) cannot exceed 10,000 bytes.

RunState

Field Name Type Description
life_cycle_state RunLifeCycleState A description of a run’s current location in the run lifecycle. This field is always avaialble in the response.
result_state RunResultState The result state of a run. If it is not available, the response won’t include this field. See RunResultState for details about the availability of result_state.
state_message STRING A descriptive message for the current state.

SparkJarTask

Field Name Type Description
jar_uri STRING Deprecated since 04/2016. Please provide a jar through the libraries field instead. For an example, see Create.
main_class_name STRING

The full name of the class containing the main method to be executed. This class must be contained in a jar provided as a library.

Note that the code should use SparkContext.getOrCreate to obtain a Spark context; otherwise, runs of the job will fail.

parameters An array of STRING Parameters that will be passed to the main method.

SparkPythonTask

Field Name Type Description
python_file STRING The URI of the pyton file to be executed. Currently, only DBFS and S3 path are supported. This field is required.
parameters An array of STRING Command line parameters that will be passed to the python file.

SparkSubmitTask

Here are some important things to know.

  • Spark submit tasks can only be run on new clusters.
  • master, deploy-mode and executor-cores are configured by Databricks automatically, you cannot specify them in parameters.
  • By default, the spark submit job would use all available memory (excluding reserved memory for Databricks services). You can also set --driver-memory, and --executor-memory to a smaller value to leave some room for off-heap usage.
  • Libraries and spark conf in new_cluster specification are not supported yet. Please use --jars and --pyFiles to add java/python libraries and use --conf to set spark conf.
  • S3 and DBFS path are supported in --jars, --pyFiles, --files and app jar/python arguments.
  • It only supports versions greater than or equal to Spark 2.1.1-db5 (e.g. 2.1.1-db5-scala2.10).

For example, you can run SparkPi by setting the following parameters, assuming the jar is uploaded to DBFS already.

{
  "parameters": [
    "--class",
    "org.apache.spark.examples.SparkPi",
    "dbfs:/path/to/examples.jar",
    "10"
  ]
}
Field Name Type Description
parameters An array of STRING Command line parameters that will be passed to spark submit.

RunLifeCycleState

The life cycle state of a run. Allowed state transitions are:

  • PENDING -> RUNNING -> TERMINATING -> TERMINATED
  • PENDING -> SKIPPED
  • PENDING -> INTERNAL_ERROR
  • RUNNING -> INTERNAL_ERROR
  • TERMINATING -> INTERNAL_ERROR
PENDING
The run has been triggered. If there is not already an active run of the same job, the cluster
and execution context are being prepared. If there is already an active run of the same job,
the run will immediately transition into a SKIPPED state without preparing any resources.
RUNNING The task of this run is currently being executed.
TERMINATING The task of this run has completed, and the cluster and execution context are being cleaned up.
TERMINATED The task of this run has completed, and the cluster and execution context have been cleaned up. This state is terminal.
SKIPPED This run was aborted because a previous run of the same job was already active. This state is terminal.
INTERNAL_ERROR An exceptional state that indicates a failure in the Jobs service, such as network failure over a long period. If a run on a new cluster ends in an INTERNAL_ERROR state, the Jobs service will terminate the cluster as soon as possible. This state is terminal.

RunResultState

The result state of the run.

  • If life_cycle_state = TERMINATED: if the run had a task, the result is guaranteed to be available, and it indicates the result of the task.
  • If life_cycle_state = PENDING, RUNNING, or SKIPPED, the result state is not available.
  • If life_cycle_state = TERMINATING or lifecyclestate = INTERNAL_ERROR: the result state is available if the run had a task and managed to start it.

Once available, the result state will never change.

SUCCESS The task completed successfully.
FAILED The task completed with an error.
TIMEDOUT The run was stopped after reaching the timeout.
CANCELED The run was canceled at user request.

TriggerType

These are the type of triggers that can fire a run.

PERIODIC These are schedules that periodically trigger runs, such as a cron scheduler.
ONE_TIME These are one time triggers that only fire a single run. This means the user triggered a single run on demand through the UI or the API.
RETRY This indicates a run that is triggered as a retry of a previously failed run. This occurs when the user requests to re-run the job in case of failures.