Create, run, and manage Databricks Jobs

A job is a way to run non-interactive code in a Databricks cluster. For example, you can run an extract, transform, and load (ETL) workload interactively or on a schedule. You can also run jobs interactively in the notebook UI.

You can create and run a job using the UI, the CLI, or by invoking the Jobs API. You can repair and re-run a failed or canceled job using the UI or API. You can monitor job run results using the UI, CLI, API, and notifications (for example, email, webhook destination, or Slack notifications). This article focuses on performing job tasks using the UI. For the other methods, see Jobs CLI and Jobs API 2.1.

Your job can consist of a single task or can be a large, multi-task workflow with complex dependencies. Databricks manages the task orchestration, cluster management, monitoring, and error reporting for all of your jobs. You can run your jobs immediately or periodically through an easy-to-use scheduling system.

You can implement a task in a JAR, a Databricks notebook, a Delta Live Tables pipeline, or an application written in Scala, Java, or Python. Legacy Spark Submit applications are also supported. You control the execution order of tasks by specifying dependencies between the tasks. You can configure tasks to run in sequence or parallel. The following diagram illustrates a workflow that:

  1. Ingests raw clickstream data and performs processing to sessionize the records.

  2. Ingests order data and joins it with the sessionized clickstream data to create a prepared data set for analysis.

  3. Extracts features from the prepared data.

  4. Performs tasks in parallel to persist the features and train a machine learning model.

    Example multitask workflow

To create your first workflow with a Databricks job, see the quickstart.

Important

  • You can create jobs only in a Data Science & Engineering workspace or a Machine Learning workspace.

  • A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests response is returned when you request a run that cannot start immediately.

  • The number of jobs a workspace can create in an hour is limited to 10000 (includes “runs submit”). This limit also affects jobs created by the REST API and notebook workflows.

Create a job

  1. Do one of the following:

    • Click Jobs Icon Workflows in the sidebar and click Create Job Button.

    • In the sidebar, click New Icon New and select Job.

    The Tasks tab appears with the create task dialog.

    Create task screen
  2. Replace Add a name for your job… with your job name.

  3. Enter a name for the task in the Task name field.

  4. In the Type dropdown menu, select the type of task to run.

    • Notebook: In the Source dropdown menu, select a location for the notebook; either Workspace for a notebook located in a Databricks workspace folder or Git provider for a notebook located in a remote Git repository.

      Workspace: Use the file browser to find the notebook, click the notebook name, and click Confirm.

      Git provider: Click Edit and enter the Git repository information. See Run jobs using notebooks in a remote Git repository.

    • JAR: Specify the Main class. Use the fully qualified name of the class containing the main method, for example, org.apache.spark.examples.SparkPi. Then click Add under Dependent Libraries to add libraries required to run the task. One of these libraries must contain the main class.

      To learn more about JAR tasks, see JAR jobs.

    • Spark Submit: In the Parameters text box, specify the main class, the path to the library JAR, and all arguments, formatted as a JSON array of strings. The following example configures a spark-submit task to run the DFSReadWriteTest from the Apache Spark examples:

      ["--class","org.apache.spark.examples.DFSReadWriteTest","dbfs:/FileStore/libraries/spark_examples_2_12_3_1_1.jar","/dbfs/databricks-datasets/README.md","/FileStore/examples/output/"]
      

      Important

      There are several limitations for spark-submit tasks:

      • You can run spark-submit tasks only on new clusters.

      • Spark-submit does not support cluster autoscaling. To learn more about autoscaling, see Cluster autoscaling.

      • Spark-submit does not support Databricks Utilities. To use Databricks Utilities, use JAR tasks instead.

    • Python script: In the Source drop-down, select a location for the Python script, either Workspace for a script in the local workspace, or DBFS for a script located on DBFS or cloud storage. In the Path textbox, enter the path to the Python script:

      Workspace: In the Select Python File dialog, browse to the Python script and click Confirm. Your script must be in a Databricks repo.

      DBFS: Enter the URI of a Python script on DBFS or cloud storage; for example, dbfs:/FileStore/myscript.py.

    • Delta Live Tables Pipeline: In the Pipeline dropdown menu, select an existing Delta Live Tables pipeline.

      Important

      You can use only triggered pipelines with the Pipeline task. Continuous pipelines are not supported as a job task. To learn more about triggered and continuous pipelines, see Continuous and triggered pipelines.

    • Python Wheel: In the Package name text box, enter the package to import, for example, myWheel-1.0-py2.py3-none-any.whl. In the Entry Point text box, enter the function to call when starting the wheel. Click Add under Dependent Libraries to add libraries required to run the task.

    • SQL: In the SQL task dropdown menu, select Query, Dashboard, or Alert.

      Note

      Query: In the SQL query dropdown menu, select the query to execute when the task runs. In the SQL warehouse dropdown menu, select a serverless or pro SQL warehouse to run the task.

      Dashboard: In the SQL dashboard dropdown menu, select a dashboard to be updated when the task runs. In the SQL warehouse dropdown menu, select a serverless or pro SQL warehouse to run the task.

      Alert: In the SQL alert dropdown menu, select an alert to trigger for evaluation. In the SQL warehouse dropdown menu, select a serverless or pro SQL warehouse to run the task.

    • dbt: See Use dbt in a Databricks job for a detailed example of how to configure a dbt task.

      Preview

      The dbt task is in Public Preview.

  5. Configure the cluster where the task runs. In the Cluster dropdown menu, select either New Job Cluster or Existing All-Purpose Clusters.

    • New Job Cluster: Click Edit in the Cluster dropdown menu and complete the cluster configuration.

    • Existing All-Purpose Cluster: Select an existing cluster in the Cluster dropdown menu. To open the cluster in a new page, click the External Link icon to the right of the cluster name and description.

    To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.

  6. You can pass parameters for your task. Each task type has different requirements for formatting and passing the parameters.

    • Notebook: Click Add and specify the key and value of each parameter to pass to the task. You can override or add additional parameters when you manually run a task using the Run a job with different parameters option. Parameters set the value of the notebook widget specified by the key of the parameter. Use task parameter variables to pass a limited set of dynamic values as part of a parameter value.

    • JAR: Use a JSON-formatted array of strings to specify parameters. These strings are passed as arguments to the main method of the main class. See Configure JAR job parameters.

    • Spark Submit task: Parameters are specified as a JSON-formatted array of strings. Conforming to the Apache Spark spark-submit convention, parameters after the JAR path are passed to the main method of the main class.

    • Python script: Use a JSON-formatted array of strings to specify parameters. These strings are passed as arguments which can be parsed using the argparse module in Python.

    • Python Wheel: In the Parameters dropdown menu, select Positional arguments to enter parameters as a JSON-formatted array of strings, or select Keyword arguments > Add to enter the key and value of each parameter. Both positional and keyword arguments are passed to the Python wheel task as command-line arguments.

  7. To access additional options, including Dependent Libraries, Retry Policy, and Timeouts, click Advanced Options. See Edit a task.

  8. Click Create.

  9. To optionally set the job’s schedule, click Edit schedule in the Job details panel. See Schedule a job.

  10. To optionally allow multiple concurrent runs of the same job, click Edit concurrent runs in the Job details panel. See Maximum concurrent runs.

  11. To receive notifications on job events, click Edit email notifications or Edit system notifications in the Job details panel. See Notifications.

  12. To optionally control permission levels on the job, click Edit permissions in the Job details panel. See Control access to jobs.

To add another task, click Add Task Button below the task you just created. A shared cluster option is provided if you have configured a New Job Cluster for a previous task. You can also configure a cluster for each task when you create or edit a task. To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.

Run a job

  1. Click Jobs Icon Workflows in the sidebar.

  2. Select a job and click the Runs tab. You can run a job immediately or schedule the job to run later.

If one or more tasks in a job with multiple tasks are not successful, you can re-run the subset of unsuccessful tasks. See Repair an unsuccessful job run.

Run a job immediately

To run the job immediately, click Run Now Button.

Tip

You can perform a test run of a job with a notebook task by clicking Run Now. If you need to make changes to the notebook, clicking Run Now again after editing the notebook will automatically run the new version of the notebook.

Run a job with different parameters

You can use Run Now with Different Parameters to re-run a job with different parameters or different values for existing parameters.

  1. Click Blue Down Caret next to Run Now and select Run Now with Different Parameters or, in the Active Runs table, click Run Now with Different Parameters. Enter the new parameters depending on the type of task.

    • Notebook: You can enter parameters as key-value pairs or a JSON object. The provided parameters are merged with the default parameters for the triggered run. You can use this dialog to set the values of widgets.

    • JAR and spark-submit: You can enter a list of parameters or a JSON document. If you delete keys, the default parameters are used. You can also add task parameter variables for the run.

  2. Click Run.

Repair an unsuccessful job run

You can repair failed or canceled multi-task jobs by running only the subset of unsuccessful tasks and any dependent tasks. Because successful tasks and any tasks that depend on them are not re-run, this feature reduces the time and resources required to recover from unsuccessful job runs.

You can change job or task settings before repairing the job run. Unsuccessful tasks are re-run with the current job and task settings. For example, if you change the path to a notebook or a cluster setting, the task is re-run with the updated notebook or cluster settings.

You can view the history of all task runs on the Task run details page.

Note

  • If one or more tasks share a job cluster, a repair run creates a new job cluster; for example, if the original run used the job cluster my_job_cluster, the first repair run uses the new job cluster my_job_cluster_v1, allowing you to easily see the cluster and cluster settings used by the initial run and any repair runs. The settings for my_job_cluster_v1 are the same as the current settings for my_job_cluster.

  • Repair is supported only with jobs that orchestrate two or more tasks.

  • The Duration value displayed in the Runs tab includes the time the first run started until the time when the latest repair run finished. For example, if a run failed twice and succeeded on the third run, the duration includes the time for all three runs.

To repair an unsuccessful job run:

  1. Click Jobs Icon Jobs in the sidebar.

  2. In the Name column, click a job name. The Runs tab shows active runs and completed runs, including any unsuccessful runs.

  3. Click the link for the unsuccessful run in the Start time column of the Completed Runs (past 60 days) table. The Job run details page appears.

  4. Click Repair run. The Repair job run dialog appears, listing all unsuccessful tasks and any dependent tasks that will be re-run.

  5. To add or edit parameters for the tasks to repair, enter the parameters in the Repair job run dialog. Parameters you enter in the Repair job run dialog override existing values. On subsequent repair runs, you can return a parameter to its original value by clearing the key and value in the Repair job run dialog.

  6. Click Repair run in the Repair job run dialog.

View task run history

To view the run history of a task, including successful and unsuccessful runs:

  1. Click on a task on the Job run details page. The Task run details page appears.

  2. Select the task run in the run history dropdown menu.

Schedule a job

To define a schedule for the job:

  1. Click Edit schedule in the Job details panel and set the Schedule Type to Scheduled.

  2. Specify the period, starting time, and time zone. Optionally select the Show Cron Syntax checkbox to display and edit the schedule in Quartz Cron Syntax.

    Note

    • Databricks enforces a minimum interval of 10 seconds between subsequent runs triggered by the schedule of a job regardless of the seconds configuration in the cron expression.

    • You can choose a time zone that observes daylight saving time or UTC. If you select a zone that observes daylight saving time, an hourly job will be skipped or may appear to not fire for an hour or two when daylight saving time begins or ends. To run at every hour (absolute time), choose UTC.

    • The job scheduler is not intended for low latency jobs. Due to network or cloud issues, job runs may occasionally be delayed up to several minutes. In these situations, scheduled jobs will run immediately upon service availability.

  3. Click Save.

You can also schedule a notebook job directly in the notebook UI.

Pause and resume a job schedule

To pause a job, you can either:

  • Click Pause in the Job details panel.

  • Click Edit schedule in the Job details panel and set the Schedule Type to Manual (Paused)

To resume a paused job schedule, set the Schedule Type to Scheduled.

View jobs

Click Jobs Icon Workflows in the sidebar. The Jobs list appears. The Jobs page lists all defined jobs, the cluster definition, the schedule, if any, and the result of the last run.

Note

If you have the increased jobs limit enabled for this workspace, only 25 jobs are displayed in the Jobs list to improve the page loading time. Use the left and right arrows to page through the full list of jobs.

You can filter jobs in the Jobs list:

  • Using keywords. If you have the increased jobs limit feature enabled for this workspace, searching by keywords is supported only for the name, job ID, and job tag fields.

  • Selecting only the jobs you own.

  • Selecting all jobs you have permissions to access. Access to this filter requires that Jobs access control is enabled.

  • Using tags. To search for a tag created with only a key, type the key into the search box. To search for a tag created with a key and value, you can search by the key, the value, or both the key and value. For example, for a tag with the key department and the value finance, you can search for department or finance to find matching jobs. To search by both the key and value, enter the key and value separated by a colon; for example, department:finance.

You can also click any column header to sort the list of jobs (either descending or ascending) by that column. When the increased jobs limit feature is enabled, you can sort only by Name, Job ID, or Created by. The default sorting is by Name in ascending order.

View runs for a job

  1. Click Jobs Icon Workflows in the sidebar.

  2. In the Name column, click a job name. The Runs tab appears with a table of active runs and completed runs.

Job runs table

To view details for a job run, click the link for the run in the Start time column of the Completed Runs (past 60 days) table. To view details for the most recent successful run of this job, click Latest successful run (refreshes automatically).

To switch to a matrix view, click Matrix. The matrix view shows a history of runs for the job, including each job task.

The Job Runs row of the matrix displays the total duration of the run and the state of the run. To view details of the run, including the start time, duration, and status, hover over the bar in the Job Runs row.

Each cell in the Tasks row represents a task and the corresponding status of the task. To view details of each task, including the start time, duration, cluster, and status, hover over the cell for that task.

The job run and task run bars are color-coded to indicate the status of the run. Successful runs are green, unsuccessful runs are red, and skipped runs are pink. The height of the individual job run and task run bars provides a visual indication of the run duration.

Job runs matrix

Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, Databricks recommends that you export results before they expire. For more information, see Export job run results.

View job run details

The job run details page contains job output and links to logs, including information about the success or failure of each task in the job run. You can access job run details from the Runs tab for the job. To view job run details from the Runs tab, click the link for the run in the Start time column of the Completed Runs (past 60 days) table. To return to the Runs tab for the job, click on the Job ID value.

Click on a task to view task run details, including:

  • the cluster that ran the task

    • the Spark UI for the task

    • logs for the task

    • metrics for the task

Click the Job ID value to return to the Runs tab for the job. Click the Job run ID value to return to the job run details.

View recent job runs

You can view a list of currently running and recently completed runs for all jobs in a workspace you have access to, including runs started by external orchestration tools such as Apache Airflow or Azure Data Factory. To view the list of recent job runs:

  1. Click Jobs Icon Workflows in the sidebar. The Jobs list appears.

  2. Click the Job runs tab. The Job runs list appears.

The Job runs list displays:

  • The start time for the run.

  • The name of the job associated with the run.

  • The user name that the job runs as.

  • Whether the run was triggered by a job schedule or an API request, or was manually started.

  • The time elapsed for a currently running job, or the total running time for a completed run.

  • The status of the run, either Pending, Running, Skipped, Succeeded, Failed, Terminating, Terminated, Internal Error, Timed Out, Canceled, Canceling, or Waiting for Retry.

To view job run details, click the link in the Start time column for the run. To view job details, click the job name in the Job column.

Export job run results

You can export notebook run results and job run logs for all job types.

Export notebook run results

You can persist job runs by exporting their results. For notebook job runs, you can export a rendered notebook that can later be imported into your Databricks workspace.

To export notebook run results for a job with a single task:

  1. On the job detail page, click the View Details link for the run in the Run column of the Completed Runs (past 60 days) table.

  2. Click Export to HTML.

To export notebook run results for a job with multiple tasks:

  1. On the job detail page, click the View Details link for the run in the Run column of the Completed Runs (past 60 days) table.

  2. Click the notebook task to export.

  3. Click Export to HTML.

Export job run logs

You can also export the logs for your job run. You can set up your job to automatically deliver logs to DBFS or S3 through the Job API. See the new_cluster.cluster_log_conf object in the request body passed to the Create a new job operation (POST /jobs/create) in the Jobs API.

Edit a job

Some configuration options are available on the job, and other options are available on individual tasks. For example, the maximum concurrent runs can be set on the job only, while parameters must be defined for each task.

To change the configuration for a job:

  1. Click Jobs Icon Workflows in the sidebar.

  2. In the Name column, click the job name.

The side panel displays the Job details. You can change the schedule, cluster configuration, notifications, maximum number of concurrent runs, and add or change tags. If job access control is enabled, you can also edit job permissions.

Tags

To add labels or key:value attributes to your job, you can add tags when you edit the job. You can use tags to filter jobs in the Jobs list; for example, you can use a department tag to filter all jobs that belong to a specific department.

Note

Because job tags are not designed to store sensitive information such as personally identifiable information or passwords, Databricks recommends using tags for non-sensitive values only.

Tags also propagate to job clusters created when a job is run, allowing you to use tags with your existing cluster monitoring.

To add or edit tags, click + Tag in the Job details side panel. You can add the tag as a key and value, or a label. To add a label, enter the label in the Key field and leave the Value field empty.

Clusters

To see tasks associated with a cluster, hover over the cluster in the side panel. To change the cluster configuration for all associated tasks, click Configure under the cluster. To configure a new cluster for all associated tasks, click Swap under the cluster.

Maximum concurrent runs

The maximum number of parallel runs for this job. Databricks skips the run if the job has already reached its maximum number of active runs when attempting to start a new run. Set this value higher than the default of 1 to perform multiple runs of the same job concurrently. This is useful, for example, if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or you want to trigger multiple runs that differ by their input parameters.

Notifications

To notify when runs of this job begin, complete, or fail, you can add one or more email addresses or system destinations (for example, webhook destinations or Slack).

Note

  • Notifications you set at the job level are not sent when failed tasks are retried. To receive a failure notification after every failed task (including every failed retry), use task notifications instead.

  • System destinations are in Public Preview.

  • System destinations must be configured by an administrator. System destinations are configured by selecting Create new destination in the Edit system notifications dialog or in the admin console.

To add email notifications:

  1. Click Edit email notifications.

  2. Click Add.

  3. Enter an email address and click the check box for each notification type to send to that address.

  4. To enter another email address for notification, click Add.

  5. If you do not want to receive notifications for skipped job runs, click the check box.

  6. Click Confirm.

Integrate these email notifications with your favorite notification tools, including:

To add system notifications:

Note

There is a limit of three system destinations for each notification type.

  1. Click Edit system destinations.

  2. In Select a system destination, select a destination and click the check box for each notification type to send to that destination.

  3. To add another destination, click Select a system destination again and select a destination.

  4. Click Confirm.

Control access to jobs

Job access control enables job owners and administrators to grant fine-grained permissions on their jobs. Job owners can choose which other users or groups can view the results of the job. Owners can also choose who can manage their job runs (Run now and Cancel run permissions).

See Jobs access control for details.

Edit a task

To set task configuration options:

  1. Click Jobs Icon Workflows in the sidebar.

  2. In the Name column, click the job name.

  3. Click the Tasks tab.

Task dependencies

You can define the order of execution of tasks in a job using the Depends on dropdown menu. You can set this field to one or more tasks in the job.

Edit task dependencies

Note

Depends on is not visible if the job consists of only a single task.

Configuring task dependencies creates a Directed Acyclic Graph (DAG) of task execution, a common way of representing execution order in job schedulers. For example, consider the following job consisting of four tasks:

Task dependencies example diagram
  • Task 1 is the root task and does not depend on any other task.

  • Task 2 and Task 3 depend on Task 1 completing first.

  • Finally, Task 4 depends on Task 2 and Task 3 completing successfully.

Databricks runs upstream tasks before running downstream tasks, running as many of them in parallel as possible. The following diagram illustrates the order of processing for these tasks:

Task dependencies example flow

Individual task configuration options

Individual tasks have the following configuration options:

Cluster

To configure the cluster where a task runs, click the Cluster dropdown menu. You can edit a shared job cluster, but you cannot delete a shared cluster if it is still used by other tasks.

To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.

Dependent libraries

Dependent libraries will be installed on the cluster before the task runs. You must set all task dependencies to ensure they are installed before the run starts.

To add a dependent library, click Advanced options and select Add Dependent Libraries to open the Add Dependent Library chooser. Follow the recommendations in Library dependencies for specifying dependencies.

Task notifications

To add email notifications for task success or failure, click Advanced options and select Edit notifications. Failure notifications are sent on initial task failure and any subsequent retries.

Task parameter variables

You can pass templated variables into a job task as part of the task’s parameters. These variables are replaced with the appropriate values when the job task runs. You can use task parameter values to pass the context about a job run, such as the run ID or the job’s start time.

When a job runs, the task parameter variable surrounded by double curly braces is replaced and appended to an optional string value included as part of the value. For example, to pass a parameter named MyJobId with a value of my-job-6 for any run of job ID 6, add the following task parameter:

{
  "MyJobID": "my-job-{{job_id}}"
}

The contents of the double curly braces are not evaluated as expressions, so you cannot do operations or functions within double-curly braces. Whitespace is not stripped inside the curly braces, so {{  job_id  }} will not be evaluated.

The following task parameter variables are supported:

Variable

Description

Example value

{{job_id}}

The unique identifier assigned to a job

1276862

{{run_id}}

The unique identifier assigned to a task run

3447843

{{start_date}}

The date a task run started. The format is yyyy-MM-dd in UTC timezone.

2021-02-15

{{start_time}}

The timestamp of the run’s start of execution after the cluster is created and ready. The format is milliseconds since UNIX epoch in UTC timezone, as returned by System.currentTimeMillis().

1551622063030

{{task_retry_count}}

The number of retries that have been attempted to run a task if the first attempt fails. The value is 0 for the first attempt and increments with each retry.

0

{{parent_run_id}}

The unique identifier assigned to the run of a job with multiple tasks.

3447835

{{task_key}}

The unique name assigned to a task that’s part of a job with multiple tasks.

“clean_raw_data”

You can set these variables with any task when you Create a job, Edit a job, or Run a job with different parameters.

You can also use arbitrary parameters in your Python tasks with task values. You set and get task values using the taskValues subutility in Databricks Utilities. Using task values, you can set a variable in a task and then consume that variable in subsequent tasks in the same job run. The following example sets and gets a name value and an age value:

dbutils.jobs.taskValues.set('name', 'Some User')
dbutils.jobs.taskValues.set('age', 42)

dbutils.jobs.taskValues.get('my-task', 'name', default='John Doe')
dbutils.jobs.taskValues.get('my-task', 'age')

For more information on using the taskValues subutility, see Jobs utility (dbutils.jobs).

Timeout

The maximum completion time for a job. If the job does not complete in this time, Databricks sets its status to “Timed Out”.

Retries

A policy that determines when and how many times failed runs are retried. To set the retries for the task, click Advanced options and select Edit Retry Policy. The retry interval is calculated in milliseconds between the start of the failed run and the subsequent retry run.

Note

If you configure both Timeout and Retries, the timeout applies to each retry.

Clone a job

You can quickly create a new job by cloning an existing job. Cloning a job creates an identical copy of the job, except for the job ID. On the job’s page, click More … next to the job’s name and select Clone from the dropdown menu.

Clone a task

You can quickly create a new task by cloning an existing task:

  1. On the job’s page, click the Tasks tab.

  2. Select the task to clone.

  3. Click Jobs Vertical Ellipsis and select Clone task.

Delete a job

To delete a job, on the job’s page, click More … next to the job’s name and select Delete from the dropdown menu.

Delete a task

To delete a task:

  1. Click the Tasks tab.

  2. Select the task to be deleted.

  3. Click Jobs Vertical Ellipsis and select Remove task.

Copy a task path

To copy the path to a task, for example, a notebook path:

  1. Click the Tasks tab.

  2. Select the task containing the path to copy.

  3. Click Jobs Copy Icon next to the task path to copy the path to the clipboard.

Run jobs using notebooks in a remote Git repository

You can run jobs with notebooks located in a remote Git repository. This feature simplifies creation and management of production jobs and automates continuous deployment:

  • You don’t need to create a separate production repo in Databricks, manage permissions for it, and keep it updated.

  • You can prevent unintentional changes to a production job, such as local edits in the production repo or changes from switching a branch.

  • The job definition process has a single source of truth in the remote repository.

To use notebooks in a remote Git repository, you must Set up Databricks Repos.

To create a task with a notebook located in a remote Git repository:

  1. In the Type dropdown menu, select Notebook.

  2. In the Source dropdown menu, select Git provider. The Git information dialog appears.

  3. In the Git Information dialog, enter details for the repository.

    For Path, enter a relative path to the notebook location, such as etl/notebooks/.

    When you enter the relative path, don’t begin it with / or ./ and don’t include the notebook file extension, such as .py.

Additional notebook tasks in a multitask job can reference the same commit in the remote repository in one of the following ways:

  • sha of $branch/head when git_branch is set

  • sha of $tag when git_tag is set

  • the value of git_commit

Best practices

Cluster configuration tips

Cluster configuration is important when you operationalize a job. The following provides general guidance on choosing and configuring job clusters, followed by recommendations for specific job types.

Use shared job clusters

To optimize resource usage with jobs that orchestrate multiple tasks, use shared job clusters. A shared job cluster allows multiple tasks in the same job run to reuse the cluster. You can use a single job cluster to run all tasks that are part of the job, or multiple job clusters optimized for specific workloads. To use a shared job cluster:

  1. Select New Job Clusters when you create a task and complete the cluster configuration.

  2. Select the new cluster when adding a task to the job, or create a new job cluster. Any cluster you configure when you select New Job Clusters is available to any task in the job.

A shared job cluster is scoped to a single job run, and cannot be used by other jobs or runs of the same job.

Libraries cannot be declared in a shared job cluster configuration. You must add dependent libraries in task settings.

Choose the correct cluster type for your job

  • New Job Clusters are dedicated clusters for a job or task run. A shared job cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes. The cluster is not terminated when idle but terminates only after all tasks using it have completed. If a shared job cluster fails or is terminated before all tasks have finished, a new cluster is created. A cluster scoped to a single task is created and started when the task starts and terminates when the task completes. In production, Databricks recommends using new shared or task scoped clusters so that each job or task runs in a fully isolated environment.

  • When you run a task on a new cluster, the task is treated as a data engineering (task) workload, subject to the task workload pricing. When you run a task on an existing all-purpose cluster, the task is treated as a data analytics (all-purpose) workload, subject to all-purpose workload pricing.

  • If you select a terminated existing cluster and the job owner has Can Restart permission, Databricks starts the cluster when the job is scheduled to run.

  • Existing all-purpose clusters work best for tasks such as updating dashboards at regular intervals.

Use a pool to reduce cluster start times

To decrease new job cluster start time, create a pool and configure the job’s cluster to use the pool.

Automatic availability zones

To take advantage of automatic availability zones (Auto-AZ), you must enable it with the Clusters API, setting awsattributes.zone_id = "auto". See Availability zones.

Notebook job tips

Total notebook cell output (the combined output of all notebook cells) is subject to a 20MB size limit. Additionally, individual cell output is subject to an 8MB size limit. If total cell output exceeds 20MB in size, or if the output of an individual cell is larger than 8MB, the run is canceled and marked as failed.

If you need help finding cells near or beyond the limit, run the notebook against an all-purpose cluster and use this notebook autosave technique.

Streaming tasks

Spark Streaming jobs should never have maximum concurrent runs set to greater than 1. Streaming jobs should be set to run using the cron expression "* * * * * ?" (every minute).

Since a streaming task runs continuously, it should always be the final task in a job.

JAR jobs

When running a JAR job, keep in mind the following:

Output size limits

Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run is canceled and marked as failed.

To avoid encountering this limit, you can prevent stdout from being returned from the driver to Databricks by setting the spark.databricks.driver.disableScalaOutput Spark configuration to true. By default, the flag value is false. The flag controls cell output for Scala JAR jobs and Scala notebooks. If the flag is enabled, Spark does not return job execution results to the client. The flag does not affect the data that is written in the cluster’s log files. Setting this flag is recommended only for job clusters for JAR jobs because it will disable notebook results.

Use the shared SparkContext

Because Databricks is a managed service, some code changes may be necessary to ensure that your Apache Spark jobs run correctly. JAR job programs must use the shared SparkContext API to get the SparkContext. Because Databricks initializes the SparkContext, programs that invoke new SparkContext() will fail. To get the SparkContext, use only the shared SparkContext created by Databricks:

val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()

There are also several methods you should avoid when using the shared SparkContext.

  • Do not call SparkContext.stop().

  • Do not call System.exit(0) or sc.stop() at the end of your Main program. This can cause undefined behavior.

Use try-finally blocks for job clean up

Consider a JAR that consists of two parts:

  • jobBody() which contains the main part of the job.

  • jobCleanup() which has to be executed after jobBody() whether that function succeeded or returned an exception.

As an example, jobBody() may create tables, and you can use jobCleanup() to drop these tables.

The safe way to ensure that the clean up method is called is to put a try-finally block in the code:

try {
  jobBody()
} finally {
  jobCleanup()
}

You should not try to clean up using sys.addShutdownHook(jobCleanup) or the following code:

val cleanupThread = new Thread { override def run = jobCleanup() }
Runtime.getRuntime.addShutdownHook(cleanupThread)

Due to the way the lifetime of Spark containers is managed in Databricks, the shutdown hooks are not run reliably.

Configure JAR job parameters

You pass parameters to JAR jobs with a JSON string array. See the spark_jar_task object in the request body passed to the Create a new job operation (POST /jobs/create) in the Jobs API. To access these parameters, inspect the String array passed into your main function.

Library dependencies

The Spark driver has certain library dependencies that cannot be overridden. These libraries take priority over any of your libraries that conflict with them.

To get the full list of the driver library dependencies, run the following command inside a notebook attached to a cluster of the same Spark version (or the cluster with the driver you want to examine).

%sh
ls /databricks/jars

Manage library dependencies

A good rule of thumb when dealing with library dependencies while creating JARs for jobs is to list Spark and Hadoop as provided dependencies. On Maven, add Spark and Hadoop as provided dependencies, as shown in the following example:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.3.0</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-core</artifactId>
  <version>1.2.1</version>
  <scope>provided</scope>
</dependency>

In sbt, add Spark and Hadoop as provided dependencies, as shown in the following example:

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0" % "provided"
libraryDependencies += "org.apache.hadoop" %% "hadoop-core" % "1.2.1" % "provided"

Tip

Specify the correct Scala version for your dependencies based on the version you are running.