This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for each scenario based on error messages found in logs.
In this topic:
Driver failed to start in time
INTERNAL_ERROR: The Spark driver failed to start within 300 seconds
Cluster failed to be healthy within 200 seconds
The cluster can fail to launch if it has a connection to an external Hive metastore and it tries to download all the Hive metastore libraries from a
maven repo. A cluster downloads almost 200 jar files, including dependencies. If the Databricks cluster manager cannot confirm that the driver is ready within 5 minutes, then cluster launch fails. This can occur because jar downloading is taking too much time.
Store the Hive libraries in DBFS and access them locally from the DBFS location. See Spark Options with the External Hive Metastore.
The cluster could not be started in 50 minutes. Cause: Timed out with exception after <xxx> attempts
Init scripts that run during the cluster spin-up stage send an RPC (remote procedure call) to each worker machine to run the scripts locally. All RPCs must return their status before the process continues. If any RPC hits an issue and doesn’t respond back (due to a transient networking issue, for example), then the 1-hour timeout can be hit, causing the cluster setup job to fail.
Use a cluster-scoped init script instead of global or cluster-named init scripts. With cluster-scoped init scripts, Databricks does not use synchronous blocking of RPCs to fetch init script execution status.
Library installation timed out after 1800 seconds. Libraries that are not yet installed:
Usually you can fix this problem by re-running the job or restarting the cluster.
The library installer is configured to time out after 3 minutes. While fetching and installing jars, a timeout can occur due to network problems. To mitigate this issue, you can download the libraries from
maven to a DBFS location and install it from there.
Cluster terminated. Reason: Cloud Provider Limit
See the cloud provider error information in Unexpected Cluster Termination.