Failing jobs or executors removed

So you’re seeing failed jobs or removed executors:

Failing Jobs

The most common reasons for executors being removed are:

  • Autoscaling: In this case it’s expected and not an error. See Enable autoscaling.

  • Spot instance losses: The cloud provider is reclaiming your VMs. You can learn more about Spot instances here.

  • Executors running out of memory

Failing jobs

If you see any failing jobs click on them to get to their pages. Then scroll down to see the failed stage and a failure reason:

Failure Reason

You may get a generic error. Click on the link in the description to see if you can get more info:

Failure Description

If you scroll down in this page, you will be able to see why each task failed. In this case it’s becoming clear there’s a memory issue:

Failed Tasks

Failing executors

To find out why your executors are failing, you’ll first want to check the compute’s Event log to see if there’s any explanation for why the executors failed. For example, it’s possible you’re using spot instances and the cloud provider is taking them back.

Event Log

See if there are any events explaining the loss of executors. For example you may see messages indicating that the cluster is resizing or spot instances are being lost.

If you don’t see any information in the event log, navigate back to the Spark UI then click the Executors tab:

Executors

Here you can get the logs from the failed executors:

Failed Executors

Next step

If you’ve gotten this far, the likeliest explanation is a memory issue. The next step is to dig into memory issues. See Spark memory issues.