Troubleshoot and repair job failures

Suppose that you’ve been notified (for example through an email notification, a monitoring solution, or in the Databricks Jobs UI) that a task has failed in a run of your Databricks job. The steps in this article provide guidance to help you identify why the job run failed, suggestions to fix the issues that you find, and how to repair failed job runs.

Identify the cause of failure

To find the failed task in the Databricks Jobs UI:

  1. Click Jobs Icon Jobs in the sidebar.

  2. In the Name column, click a job name. The Runs tab shows active runs and completed runs, including any unsuccessful runs.

  3. To switch to a matrix view, click Matrix. The matrix view shows a history of runs for the job, including successful and failed runs for each job task. Using the matrix view, you can quickly identify the task failures for your job run.

    Matrix view of job runs
  4. Hover over a failed task to see associated metadata. This metadata includes the start and end dates, status, duration cluster details, and, in some cases, an error message.

  5. To help identify the cause of the failure, click the failed task. The Task run details page appears, displaying the output, error message, and associated metadata for the task.

Fix the cause of failure

Your task might have failed for a number of reasons, for example, a data quality issue, a misconfiguration, or not enough compute resources. The following are suggested steps to fix some common causes of task failures:

  • If the failure is related to the task configuration, click Edit task. The task configuration opens in a new tab. Update the task configuration as required and click Save task.

  • If the issue is related to cluster resources, for example, insufficient instances, there are several options:

    • If your job is configured to use a job cluster, consider using a shared all-purpose cluster.

    • Change the cluster configuration. Click Edit task. In the Job details panel, under Compute, click Configure to configure the cluster. You can change the number of workers, the instance types, or other cluster configuration options. You can also click Swap to switch to another available cluster. To ensure you’re making optimal use of available resources, review best practices for cluster configuration.

    • If necessary, ask an administrator to increase resource quotas in the cloud account and region where your workspace is deployed.

  • If the failure is caused by exceeding the maximum concurrent runs, either:

    • Wait for other runs to complete.

    • Click Edit task. In the Job details panel, click Edit concurrent runs, enter a new value for Maximum concurrent runs, and click Confirm.

In some cases, the cause of a failure may be upstream from your job, for example, an external data source is unavailable. You can still take advantage of the repair run feature covered in the next section after the external issue is resolved.

Re-run failed and skipped tasks

The repair and rerun feature in Databricks Jobs allows you to re-run failed tasks and any downstream tasks that were skipped. After fixing the cause of the failure, you can repair the run with the following steps:

  1. Click the link for the unsuccessful run in the Start time column of the Completed Runs (past 60 days) table or click the failed task in the matrix view. The Job run details page appears.

  2. Click Repair run. The Repair job run dialog appears, listing all unsuccessful tasks and any dependent tasks that will be re-run.

  3. To add or edit parameters for the tasks to repair, enter the parameters in the Repair job run dialog. Parameters you enter in the Repair job run dialog override existing values. On subsequent repair runs, you can return a parameter to its original value by clearing the key and value in the Repair job run dialog.

  4. Click Repair run in the Repair job run dialog.

  5. After the repair run finishes, the matrix view is updated with a new column for the repaired run. Any failed tasks that were red should now be green, indicating a successful run for your entire job.