Classic compute termination error codes

This article provides troubleshooting guidance for common cluster termination error codes. Use the error code from your cluster event log to find the relevant cause and recommended fix.

AWS_INSUFFICIENT_FREE_ADDRESSES_IN_SUBNET_FAILURE

The AWS subnet has insufficient free IP addresses to launch the requested instances.

Example error message

Not enough free addresses in subnet subnet-[REDACTED] (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameterValue; Request ID: [REDACTED]; Proxy: null)

Troubleshooting steps

Check the subnet CIDR range and available IP addresses in the AWS Console.
Review the number of instances currently running in the subnet.
Check for unused elastic network interfaces that may be consuming IP addresses.
Verify whether there are IP address reservations in the subnet.

Recommended fix

Update your cluster to use a different availability zone with sufficient IP addresses, use the auto availability zone setting, expand the subnet CIDR range, or clean up unused network resources. If the issue persists, contact Databricks support.

AWS_INSUFFICIENT_INSTANCE_CAPACITY_FAILURE

AWS does not have sufficient capacity for the requested instance type in the selected availability zone.

Example error messages

We currently do not have sufficient c4.8xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get c4.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1e, us-east-1f.

There is no Spot capacity available that matches your request. (Service: AmazonEC2; Status Code: 500; Error Code: InsufficientInstanceCapacity; Request ID: [REDACTED]; Proxy: null)

Troubleshooting steps

Verify the instance type and availability zone in your cluster configuration.
Check whether the issue affects spot instances only or also affects on-demand instances.
Review the AWS Service Health Dashboard for known capacity issues.
Test with different instance types in the same family.

Recommended fix

Try launching in a different availability zone, use the auto availability zone setting, switch to a different instance type, or use on-demand instances instead of spot. For persistent capacity issues, contact AWS support.

To reduce future stockout errors, ask your workspace admin to enable flexible node types so that Databricks automatically falls back to compatible instance types when your preferred type is unavailable. Flexible node types are not available for GPU instance types.

AWS_RESOURCE_QUOTA_EXCEEDED

The cluster launch would exceed the AWS account's quota for the requested resource type.

Troubleshooting steps

Check the AWS Service Quotas console for current limits and usage.
Identify which specific quota is exceeded (instances, volumes, IPs, and so on).
Review resource usage across all regions.
Check for resources that can be cleaned up.

Recommended fix

Request a quota increase through the AWS Service Quotas console, clean up unused resources, distribute workloads across regions, or use different instance types. Contact AWS support for quota increase requests.

BOOTSTRAP_TIMEOUT_DUE_TO_MISCONFIG

The VM bootstrap process timed out due to network connectivity issues, slow artifact downloads, or issues with the cloud provider. The bootstrap timeout is 700 seconds.

Example error message

[id: InstanceId([REDACTED]), status: INSTANCE_INITIALIZING, ...] with threshold 700 seconds timed out after 703891 milliseconds. Instance bootstrap inferred timeout reason: UnknownReason

Troubleshooting steps

Check connectivity to Databricks artifact storage.
Verify connectivity to the Databricks control plane.
Check DNS resolution for Databricks endpoints.
Verify firewall and security group rules.
Test whether the issue is consistent or intermittent.

Recommended fix

Ensure network connectivity to Databricks storage and control plane. Configure service endpoints or VPC endpoints for better network performance. Review firewall, DNS, and routing configuration. Contact Databricks support if network configuration is verified, but timeouts persist.

CLOUD_OPERATION_CANCELLED

The cluster was terminated because an underlying cloud provider operation was cancelled before the instance launch completed.

Example error message

Cluster terminated because an underlying cloud operation was cancelled. GCP Error: GCE Operation failed: Operation was canceled by user ''.

Troubleshooting steps

Check cluster event logs for cloud provider error codes and messages.
Review whether a concurrent operation or automation cancelled the cloud resource.
Check cloud provider activity logs for cancelled operations.
Verify whether the issue is transient or reproducible.

Recommended fix

Retry the cluster launch. If the cancellation was caused by external automation or manual intervention, resolve the conflicting operation before retrying. Contact your cloud provider support first if the issue persists without an identifiable cause. Contact Databricks support if the cloud provider cannot identify the cause.

CLOUD_PROVIDER_RESOURCE_STOCKOUT_DUE_TO_MISCONFIG

The cloud provider could not allocate the requested VM resources due to cluster configuration constraints such as instance type, availability zone, or placement settings.

Example error message

The VM launch failed due to restrictive constraint. To reduce future stockout errors, enable flexible node types if not already enabled so Databricks can automatically fall back to alternative instance types.

Troubleshooting steps

Review the cluster instance type and availability zone configuration.
Check whether flexible node types or automatic fallback is enabled.
Verify that the requested instance type is available in the selected zone.
Review cluster event logs for constraint or placement details.

Recommended fix

Enable flexible node types, try a different availability zone, or select alternative instance types. Update compute policies or instance type allowlists to include fallback options. Contact Databricks support if configuration changes do not resolve the stockout.

CLOUD_PROVIDER_LAUNCH_FAILURE

The cloud provider failed to launch the requested VM instance. This is usually a cloud-provider-side issue.

Example error message

Reason: CLOUD_PROVIDER_LAUNCH_FAILURE (CLOUD_FAILURE). Parameters: databricks_error_message:VM launch failed because AWS returned internal error. [details] Server.InternalError: Internal error on launch(OnDemand), instance_id:[REDACTED], aws_api_error_code:Server.InternalError

Troubleshooting steps

Check the aws_error_message in the error parameters for the specific cloud-provider failure.

Check the cloud provider status page for ongoing incidents in your region.
Review quota limits and subnet capacity if the error mentions these.

Recommended fix

Try again later, as most cloud provider launch failures are transient. If the issue still occurs, contact your cloud provider support with the specific detailed error from the details.

COMMUNICATION_LOST

The cluster was terminated because the control plane lost communication with the instance. This may be caused by unexpected instance state, instance termination, or network-level issues where the control plane cannot ping the instance for a prolonged period.

Example error message

Cluster '[REDACTED]' was terminated. Reason: COMMUNICATION_LOST (CLOUD_FAILURE). Parameters: instance_id:[REDACTED], databricks_error_message:Node health check failed.

Troubleshooting steps

Check the network configuration between the Databricks compute plane and the SCC relay endpoint. If there is a firewall or proxy between them, it might block the health check communication. Consult with your network administrator.
Check CPU and memory usage of the node on cluster metrics. If resources are exhausted, the instance may not respond to the health check. Consider using a bigger instance type.
Check with your cloud provider if the instance was terminated or impaired externally (for example, AWS instance retirement, Azure host maintenance).
Review Spark driver and executor logs for errors that may have caused the instance to become unresponsive (for example, OOM or long GC pauses).

Recommended fix

Review firewall and proxy settings with your network administrator. If the error was caused by the cloud provider terminating the instance, try again later. If it was caused by resource exhaustion, consider upgrading to a larger instance type. If the issue persists, contact Databricks support.

CONTROL_PLANE_REQUEST_FAILURE / CONTROL_PLANE_REQUEST_FAILURE_DUE_TO_MISCONFIG

The VM bootstrap process failed because the instance could not reach the Databricks control plane to retrieve bootstrap steps. Both error codes share the same underlying failure and troubleshooting guidance. CONTROL_PLANE_REQUEST_FAILURE_DUE_TO_MISCONFIG is typically reported when the workspace has a history of similar control plane connectivity failures.

Example error messages

Failed to get instance bootstrap steps from the Databricks Control Plane. Please check that instances have connectivity to the Databricks Control Plane.

Network health check reported that instance is unable to reach Databricks Control Plane. Please check that instances have connectivity to the Databricks Control Plane. Instance bootstrap inferred timeout reason: NetworkHealthCheck_CP_Failed

Troubleshooting steps

Decode any Base64-encoded error messages in the cluster event log.
Check DNS settings in your network configuration.
Review firewall rules and network security settings.
Test control plane connectivity from a VM in the same network.
Verify custom DNS servers are functional and reachable.

Recommended fix

Verify DNS server configuration and reachability. Ensure firewall rules allow outbound traffic to the Databricks control plane.

Contact Databricks support if the network configuration appears correct, but the issue persists.

DOCKER_CONTAINER_CREATION_EXCEPTION

The cluster failed to create the Docker container on the instance due to a container setup error.

Example error message

Failed to launch container due to an error while creating the container. Please revisit your container setup

Troubleshooting steps

Verify the custom Docker image configuration and entrypoint settings.
Check whether the container requires root privileges or unsupported capabilities.
Review cluster event logs for detailed container creation errors.

Recommended fix

Correct the Docker container configuration based on the error message. Ensure the image follows Databricks requirements for custom containers.

For guidance on building a custom container image, see Databricks Container Services for dedicated compute or Databricks Container Services for standard compute.

Retry the cluster launch after updating the image or configuration. Contact Databricks support if the container setup appears correct.

DOCKER_IMAGE_PULL_FAILURE

The cluster failed to download the Docker image from the container registry due to network, authentication, or configuration issues.

Example error message

Failed to pull docker image: authentication required

Troubleshooting steps

Verify the Docker image name and tag are correct in the cluster configuration.
Check network connectivity to the container registry from the workspace.
Test registry access from a VM in the same network.
Verify authentication credentials for private registries.
Review node daemon logs for detailed error messages.

Recommended fix

Correct the Docker image configuration and verify authentication credentials. Ensure network rules allow access to the container registry.

For AWS ECR, configure VPC endpoints to avoid routing artifact downloads through the public internet.

Contact Databricks support if the configuration appears correct, but the issue persists.

DOCKER_IMAGE_TOO_LARGE_FOR_INSTANCE_EXCEPTION

The Docker image size exceeds the available disk space on the selected instance type.

Example error message

Failed to launch container as the docker image is too large for the instance.

Troubleshooting steps

Check the Docker image size.
Review the instance type's disk capacity.
Identify unnecessary layers or files in the Docker image.
Check whether multiple large images are being used.

Recommended fix

Use an instance type with a larger disk capacity, optimize the Docker image by removing unnecessary files and layers, use multi-stage builds to reduce image size, or split functionality across multiple smaller images. Contact Databricks support for assistance with image optimization.

DOCKER_INVALID_OS_EXCEPTION

The custom Docker container uses an operating system that is not supported for Databricks compute.

Example error message

Failed to launch container due to an invalid Docker OS. Please revisit your OS configuration.

Troubleshooting steps

Verify the base operating system in the custom Docker image.
Review Databricks documentation for supported container operating systems.
Check the cluster Docker image configuration for the correct image reference.

Recommended fix

Rebuild the Docker image using a supported operating system base image. Databricks recommends using a Databricks base image. Ubuntu and Alpine Linux are also supported.

For guidance on building a custom container image, see Databricks Container Services for dedicated compute or Databricks Container Services for standard compute.

Update the cluster configuration with the corrected image and retry the launch. Contact Databricks support if the OS should be supported.

EOS_SPARK_IMAGE

The Databricks Runtime (DBR) version configured for the cluster has reached end of support (EOS).

Example error message

Spark image release__11.0.x-snapshot-cpu-ml-scala2.12__databricks-universe__head__[REDACTED]__format-2 does not exist with exit code 2

Troubleshooting steps

Check the DBR version in the cluster configuration.
Review the Databricks Runtime release notes for EOS dates.
Identify which DBR versions are currently supported.
Check whether notebooks or jobs have DBR version dependencies.

Recommended fix

Update the cluster configuration to use a supported Databricks Runtime version. Review compatibility requirements for libraries and code before deploying to production. Contact Databricks support if you need assistance with DBR migration.

INSTANCE_POOL_MAX_CAPACITY_REACHED

The instance pool has reached its configured maximum capacity limit and cannot provide additional instances.

Example error message

Instance pool is full, please consider increasing the pool size

Troubleshooting steps

Check the instance pool configuration for the maximum capacity setting.
Review how many instances are currently in use from the pool.
Identify which clusters are using the pool.
Check whether there are idle instances that can be freed.

Recommended fix

Increase the instance pool maximum capacity, create additional instance pools to distribute load, terminate idle clusters using the pool, or configure clusters to use different pools. Review pool sizing based on concurrent workload requirements.

INSTANCE_POOL_NOT_FOUND

The cluster references an instance pool that does not exist or is no longer active.

Example error message

Instances could not be provisioned for the cluster since the instance pool is no longer active

Troubleshooting steps

Verify the instance pool ID in the cluster configuration.
Check whether the instance pool was deleted or deactivated.
Review cluster and job configurations for stale instance pool references.
Confirm the instance pool exists in the same workspace as the cluster.

Recommended fix

Update the cluster configuration to use an existing instance pool, or remove the instance pool reference to launch instances directly. Recreate the instance pool if needed. Contact Databricks support if the instance pool should exist but cannot be found.

INSTANCE_UNREACHABLE / INSTANCE_UNREACHABLE_DUE_TO_MISCONFIG

Instances are unreachable due to network misconfiguration, firewall rules, or connectivity issues. Both error codes share the same underlying failure and troubleshooting guidance. INSTANCE_UNREACHABLE_DUE_TO_MISCONFIG is typically reported when the workspace has a history of similar connectivity failures.

Example error message

Bootstrap completes in the VM but control plane failed to reach the node. Please review your network configuration or firewall settings to allow Databricks to reach the node.

Troubleshooting steps

Review firewall rules and network security settings for required inbound ports.
Test connectivity from the control plane to the instance network.
Check for asymmetric routing issues.
Review firewall logs for dropped connections.
Verify that instances have the correct security group assignments.

Recommended fix

Ensure security groups or NSGs allow required inbound traffic from the Databricks control plane. Verify that route tables enable bidirectional communication. Contact Databricks support for assistance with network connectivity troubleshooting.

INVALID_ARGUMENT

Invalid configuration parameters, missing secrets, incorrect permissions, or misconfigured cluster settings prevented the cluster from starting.

Example error message

com.databricks.backend.manager.secret.SecretPermissionDeniedException: User does not have permission with scope: [REDACTED] and key: [REDACTED]

Troubleshooting steps

Review the error message to identify the specific invalid parameter.
For secret errors, verify the secret scope and key exist using the Databricks Secrets API.
Check user or service principal permissions for accessing secrets.
Review the cluster configuration for syntax errors.
Check init scripts for configuration errors.

Recommended fix

Correct the invalid parameter based on the error message. For secrets, verify scope and key existence, check permissions, and ensure network connectivity to secret providers. Validate all cluster configuration against the documentation. Contact Databricks support if the configuration appears correct.

INVALID_WORKER_ENVIRONMENT

The cluster failed to launch because the worker environment does not exist.

This error can occur immediately after workspace creation, while the worker environment is still being provisioned.

Example error message

[details] NO_SUCH_WORKER_ENVIRONMENT_EXCEPTION: Delegate unexpected exception during asynchronous instance launch com.databricks.backend.manager.util.WorkerEnvironmentNotFoundException: Worker environment

Troubleshooting steps

Check when the workspace was created. If it was created recently, the worker environment may still be provisioning.
Review cluster event logs for worker environment error details.

Recommended fix

Wait several minutes after workspace creation, then retry the cluster launch.

Contact Databricks support if the error persists on an active workspace that was not recently created or restored.

NETWORK_CHECK_CONTROL_PLANE_FAILURE / NETWORK_CHECK_CONTROL_PLANE_FAILURE_DUE_TO_MISCONFIG

A pre-bootstrap network health check failed when attempting to reach the Databricks control plane. Both error codes share the same underlying failure and troubleshooting guidance. NETWORK_CHECK_CONTROL_PLANE_FAILURE_DUE_TO_MISCONFIG is typically reported when the workspace has a history of similar network health check failures.

Example error message

Instance failed network health check before bootstrapping with fatal error: X_NHC_CONTROL_PLANE_UNREACHABLE
1 failed component(s): control_plane
Retryable: true

Troubleshooting steps

Review cluster event logs for specific connection failure details.
Test control plane connectivity from a VM in the same network.
Check DNS resolution for Databricks control plane endpoints.
Check whether a firewall is intercepting or blocking traffic.
Verify firewall rules and route tables allow outbound HTTPS traffic.

Recommended fix

Verify that security group or NSG rules allow outbound traffic to the Databricks control plane. Correct DNS, firewall, and routing configuration as needed. If using UDR with a firewall, ensure Databricks service tags route to the internet. Contact Databricks support if network configuration is verified correct but the health check still fails.

NETWORK_CHECK_DNS_SERVER_FAILURE / NETWORK_CHECK_DNS_SERVER_FAILURE_DUE_TO_MISCONFIG

A pre-bootstrap network health check failed because the VM cannot reach the configured DNS server. Both error codes share the same underlying failure and troubleshooting guidance. NETWORK_CHECK_DNS_SERVER_FAILURE_DUE_TO_MISCONFIG is typically reported when the workspace has a history of similar network health check failures.

Example error message

[details] X_NHC_DNS_SERVER_UNREACHABLE: Instance failed network health check before bootstrapping with fatal error: X_NHC_DNS_SERVER_UNREACHABLE
4 failed component(s): control_plane dns_server internet storage
Retryable: true

Troubleshooting steps

Verify DNS server IP addresses configured for the subnet or VNet.
Test DNS server reachability from a VM in the same network.
Check firewall rules that may block DNS traffic on port 53.
Review custom DNS server configuration, forwarders, and conditional forwarding rules.
Test DNS resolution for Databricks control plane and storage endpoints from a VM in the same network.
Check for typos or unreachable DNS server IP addresses in network configuration.

Recommended fix

Ensure DNS servers are reachable and functional from the compute plane network. Update firewall or NSG rules to allow DNS traffic. If a custom DNS server is unreachable or misconfigured, consider switching to the cloud provider's default DNS. Correct DNS server configuration and ensure compute instances can resolve Databricks endpoints. Contact Databricks support if DNS configuration is verified but lookups still fail.

NETWORK_CHECK_MULTIPLE_COMPONENTS_FAILURE / NETWORK_CHECK_MULTIPLE_COMPONENTS_FAILURE_DUE_TO_MISCONFIG

A pre-bootstrap network health check failed across multiple network components such as the control plane, storage, DNS, or internet connectivity. Both error codes share the same underlying failure and troubleshooting guidance. NETWORK_CHECK_MULTIPLE_COMPONENTS_FAILURE_DUE_TO_MISCONFIG is typically reported when the workspace has a history of similar network health check failures.

Example error message

Instance failed network health check before bootstrapping with fatal error: X_NHC_MULTIPLE_COMPONENTS_FAILURE
3 failed component(s): control_plane internet storage
Retryable: false

Troubleshooting steps

Review cluster event logs to identify which components failed the health check.
Test connectivity to control plane, storage, and DNS endpoints from a VM in the same network.
Check for broad network outages or firewall changes.
Verify whether the issue is transient or persistent.
Review recent changes to firewall, DNS, proxy, or routing configuration.
Check whether a customer-managed VPC or VNet injection configuration blocks required traffic.

Recommended fix

Address the underlying network connectivity issues for all failed components. Review firewall, DNS, and routing configuration holistically. Correct network misconfiguration affecting multiple endpoints and ensure settings allow access to the Databricks control plane and artifact storage. Retry the cluster launch after network issues are resolved. Contact Databricks support if multiple components fail despite verified network configuration.

NETWORK_CHECK_NIC_FAILURE

A pre-bootstrap network health check detected a network interface card (NIC) issue such as the interface being down or missing required routes.

Example error message

Instance failed network health check before bootstrapping with fatal error: X_NHC_NIC_STATE_DOWN
1 failed component(s): nic
Retryable: true

Troubleshooting steps

Review cluster event logs for NIC state or routing errors.
Verify subnet and route table configuration.
Review VM OS logs and metrics for NIC or host-level networking issues on cloud provider console.
Confirm that the VM was provisioned with no error on cloud provider console.
Check for cloud provider networking incidents in the region.

Recommended fix

Retry the cluster launch, as NIC issues are often transient. If the issue persists, review network configuration with your cloud provider or network administrator. Contact Databricks support if failures continue across multiple retry attempts.

NETWORK_CHECK_STORAGE_FAILURE

A pre-bootstrap network health check failed because the VM cannot reach Databricks artifact storage.

Example error message

[details] X_NHC_STORAGE_UNREACHABLE: Instance failed network health check before bootstrapping with fatal error: X_NHC_STORAGE_UNREACHABLE
2 failed component(s): internet storage
Retryable: true

Troubleshooting steps

Test connectivity to Databricks storage endpoints from a VM in the same network.
Verify DNS resolution for storage URLs.
Check firewall, proxy, and security group or NSG rules.
Review whether TLS inspection devices interfere with storage connections.

Recommended fix

Ensure firewall rules allow access to Databricks storage endpoints. Configure VPC or service endpoints where applicable. If a custom DNS server causes resolution delays or failures for storage URLs, switch to the cloud provider's default DNS or correct the DNS configuration. Contact Databricks support if storage connectivity is verified but the health check still fails.

NETWORK_CONFIGURATION_FAILURE

A network configuration error is preventing proper VM or cluster network setup.

Troubleshooting steps

Review firewall and security group or NSG rules.
Check route tables and routing configuration.
Verify subnet configuration.
Check for IP address conflicts.
Verify DNS settings.

Recommended fix

Correct the network configuration based on the specific error. Ensure security group or NSG rules allow required traffic, verify that subnet CIDR ranges don't overlap, check that route tables are properly configured, and ensure DNS is functional. Contact Databricks support for network configuration review.

NPIP_TUNNEL_SETUP_FAILURE

The bootstrap script failed to set up the NPIP tunnel connection within the timeout. This occurs after the cloud provider launches the instance and the bootstrap script attempts to establish the SCC relay tunnel.

Example error message

Cluster '[REDACTED]' was terminated. Reason: NPIP_TUNNEL_SETUP_FAILURE (SERVICE_FAULT). Parameters: databricks_error_message:VM setup failed due to Ngrok setup timeout. [details] NPIP_TUNNEL_SETUP_FAILURE: Instance bootstrap failed command: WaitForNgrokTunnel Failure message: Timed out waiting for ngrok tunnel to be up(OnDemand), instance_id:[REDACTED]

Troubleshooting steps

Check the network configuration between the SCC relay and the Databricks compute plane subnets.
Review firewall and proxy settings that might block tunnel setup on port 443 or 6666.

Recommended fix

Ensure network connectivity from the compute plane to the SCC relay endpoint. Launch an instance in the Databricks compute plane VPC/VNet and check connectivity to the SCC relay:

nslookup <SCC relay fqdn>
nc -vz <SCC relay fqdn> 443

Replace 443 with the PL port if you use private link.

If there is a firewall or proxy, verify it allows traffic to the relay on the required ports. Consult the public network configuration docs and ensure you have the right egress rules set up and can connect to SCC endpoint from your VPC/VNet. If the issue occurs even though there is no problem in your network configuration, contact Databricks support.

RATE_LIMITED

The cluster launch was rate limited because the workspace exceeded its capacity or request limits.

Example error message

Your workspace upsize request timed out because it exceeded the workspace-level capacity limit.

Troubleshooting steps

Check whether multiple clusters or jobs are launching simultaneously.
Review concurrent cluster and job activity in the workspace.
Identify whether the failure occurs during peak usage periods.
Check cluster event logs for workspace-level throttling messages.

Recommended fix

Reduce concurrent cluster launches, stagger job schedules, or wait before retrying. Request a workspace capacity limit increase through Databricks support if your workload consistently requires higher concurrency. Retry the cluster launch after throttling subsides.

REQUEST_THROTTLED

API requests to the cloud provider are being throttled due to rate limiting.

Example error message

TEMPORARILY_UNAVAILABLE: Too many requests from workspace [REDACTED]

Troubleshooting steps

Check whether multiple clusters are launching simultaneously.
Review API request rate limits for your account.
Identify whether other services are making concurrent API calls.
Check for automated systems making frequent requests.

Recommended fix

Reduce concurrent cluster launches, request an API rate limit increase from your cloud provider, implement exponential backoff in automation scripts, or stagger cluster launch times.

SPOT_INSTANCE_TERMINATION

Spot or preemptible instances were terminated by the cloud provider due to capacity needs or pricing changes.

Example error message

Server.SpotInstanceTermination: Spot instance termination

Troubleshooting steps

Check the cluster event logs for the termination timestamp.
Review spot pricing history in your region.
Identify whether terminations occur at specific times.
Check whether multiple instances were terminated simultaneously.

Recommended fix

Switch to on-demand instances for production workloads, implement job retry logic to handle interruptions, or use a mix of on-demand and spot instances. Spot instances are best for fault-tolerant workloads.

SPARK_IMAGE_DOWNLOAD_FAILURE

The cluster failed to download the Spark container image from Databricks artifact storage during bootstrap.

Example error message

Failed to set up spark container due to an image download failure: Exception when downloading spark image:

Troubleshooting steps

Check connectivity to Databricks artifact storage from the compute plane network.
Verify DNS resolution for storage endpoints.
Review firewall, proxy, and security group or NSG rules.
Check whether the issue affects multiple clusters or a single cluster.

Recommended fix

Ensure network connectivity to Databricks storage endpoints. Configure VPC or service endpoints where applicable to improve download reliability. Retry the cluster launch. Contact Databricks support if connectivity is verified but downloads still fail.

SPARK_IMAGE_NOT_FOUND

The specified Spark image does not exist in Databricks artifact storage.

Example error message

Failed to set up the Spark container on instance [REDACTED] could not find internal Spark image snapshot__17.x-snapshot-scala2.13__databricks__17.4.0_

Troubleshooting steps

Verify the Databricks Runtime version configured on the cluster.
Check whether a custom Spark image name or tag is specified.
Confirm the runtime version is supported in your workspace region.
Review recent changes to cluster or job configuration.

Recommended fix

Select a supported Databricks Runtime version or verify the custom Spark image exists. Update the cluster configuration to use a valid runtime version and retry the launch. Contact Databricks support if the runtime version should be available but the image cannot be found.

SPARK_STARTUP_FAILURE

The Spark driver failed to start within the configured timeout. This may occur when the driver daemon startup was not completed within the timeout (typically 200 seconds) on the cluster driver instance.

Example error messages

Cluster '[REDACTED]' was terminated. Reason: SPARK_STARTUP_FAILURE (SERVICE_FAULT). Parameters: databricks_error_message:Spark failed to start: DEADLINE_EXCEEDED.

Cluster '[REDACTED]' was terminated. Reason: SPARK_STARTUP_FAILURE (SERVICE_FAULT). Parameters: databricks_error_message:Spark failed to start: Timed out after 200 seconds.

Troubleshooting steps

Review Spark configuration for misconfigurations (for example, invalid metastore URI or conflicting settings).
Check your init scripts for potential errors that could delay or prevent driver startup.

Recommended fix

Remove custom Spark configs and init scripts to isolate the issue. Try a different instance type, as hardware slowness on smaller instances can cause driver startup timeouts. If the issue persists, contact Databricks support with the cluster ID and error details.

STORAGE_DOWNLOAD_FAILURE_SLOW

Downloading artifacts from Databricks storage is failing or too slow due to network connectivity, firewall, or DNS issues.

Example error message

Instance bootstrap failed command: Command_UpdateWorker
Failure message: Trying DNS probe for: https://[REDACTED].blob.core.windows.net/update/worker-artifacts/...

Troubleshooting steps

Check firewall rules for Databricks storage endpoints.
Verify DNS resolution for storage URLs.
Test download speed from a VM in the same network.
Review network bandwidth utilization.
Check for proxy or network inspection devices.
Verify routes to storage endpoints.

Recommended fix

Ensure firewall rules allow access to Databricks storage endpoints.

Configure VPC endpoints for S3 to avoid routing artifact downloads through the public internet.

Review and optimize network inspection devices if present. Contact Databricks support if connectivity to storage endpoints is verified but downloads still fail.

STORAGE_DOWNLOAD_FAILURE_THROTTLED

Artifact downloads during bootstrap are being throttled by the cloud storage provider due to elevated load or egress limits.

Example error message

Worker artifact download servers are seeing elevated load and throttling requests.

Troubleshooting steps

Review cluster event logs for storage-specific throttling errors (for example, HTTP 503 or ServerBusy).

Recommended fix

Retry the cluster launch after a short delay. Contact Databricks support if the issue persists across multiple retry attempts.

WORKSPACE_CANCELLED_ERROR

The cluster launch failed because the workspace was cancelled while the cluster was being provisioned.

Example error message

Workspace Cancelled Error

Troubleshooting steps

Check whether the workspace was cancelled or deleted during cluster launch.
Review workspace status in the account console.
Identify whether cluster upsize requests were in progress when the workspace was cancelled.

Recommended fix

Create a new workspace. Contact Databricks support if the workspace appears active but clusters still terminate with this error.

WORKSPACE_CONFIGURATION_ERROR

Workspace-level misconfiguration is preventing cluster launch, including issues with IAM roles or service principal permissions.

Example error message

User: arn:aws:iam::[REDACTED]:user/ConsolidatedManagerIAMUser is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::[REDACTED]:role/databricks-workspace-stack-role

Troubleshooting steps

Review recent changes to workspace configuration.
Check the cloud provider console for policy or permission changes.

Verify the cross-account IAM role trust relationship configuration and instance profile permissions to assume required roles.

Recommended fix

Verify IAM role trust relationships and instance profile permissions. Review workspace security configuration.

Contact Databricks support if the workspace configuration appears correct or if the cross-account role setup needs verification.

AWS_INSUFFICIENT_FREE_ADDRESSES_IN_SUBNET_FAILURE​

AWS_INSUFFICIENT_INSTANCE_CAPACITY_FAILURE​

AWS_RESOURCE_QUOTA_EXCEEDED​

BOOTSTRAP_TIMEOUT_DUE_TO_MISCONFIG​

CLOUD_OPERATION_CANCELLED​

CLOUD_PROVIDER_RESOURCE_STOCKOUT_DUE_TO_MISCONFIG​

CLOUD_PROVIDER_LAUNCH_FAILURE​

COMMUNICATION_LOST​

CONTROL_PLANE_REQUEST_FAILURE / CONTROL_PLANE_REQUEST_FAILURE_DUE_TO_MISCONFIG​

DOCKER_CONTAINER_CREATION_EXCEPTION​

DOCKER_IMAGE_PULL_FAILURE​

DOCKER_IMAGE_TOO_LARGE_FOR_INSTANCE_EXCEPTION​

DOCKER_INVALID_OS_EXCEPTION​

EOS_SPARK_IMAGE​

INSTANCE_POOL_MAX_CAPACITY_REACHED​

INSTANCE_POOL_NOT_FOUND​

INSTANCE_UNREACHABLE / INSTANCE_UNREACHABLE_DUE_TO_MISCONFIG​

INVALID_ARGUMENT​

INVALID_WORKER_ENVIRONMENT​

NETWORK_CHECK_CONTROL_PLANE_FAILURE / NETWORK_CHECK_CONTROL_PLANE_FAILURE_DUE_TO_MISCONFIG​

NETWORK_CHECK_DNS_SERVER_FAILURE / NETWORK_CHECK_DNS_SERVER_FAILURE_DUE_TO_MISCONFIG​

NETWORK_CHECK_MULTIPLE_COMPONENTS_FAILURE / NETWORK_CHECK_MULTIPLE_COMPONENTS_FAILURE_DUE_TO_MISCONFIG​

NETWORK_CHECK_NIC_FAILURE​

NETWORK_CHECK_STORAGE_FAILURE​

NETWORK_CONFIGURATION_FAILURE​

NPIP_TUNNEL_SETUP_FAILURE​

RATE_LIMITED​

REQUEST_THROTTLED​

SPOT_INSTANCE_TERMINATION​

SPARK_IMAGE_DOWNLOAD_FAILURE​

SPARK_IMAGE_NOT_FOUND​

SPARK_STARTUP_FAILURE​

STORAGE_DOWNLOAD_FAILURE_SLOW​

STORAGE_DOWNLOAD_FAILURE_THROTTLED​

WORKSPACE_CANCELLED_ERROR​

WORKSPACE_CONFIGURATION_ERROR​

AWS_INSUFFICIENT_FREE_ADDRESSES_IN_SUBNET_FAILURE

AWS_INSUFFICIENT_INSTANCE_CAPACITY_FAILURE

AWS_RESOURCE_QUOTA_EXCEEDED

BOOTSTRAP_TIMEOUT_DUE_TO_MISCONFIG

CLOUD_OPERATION_CANCELLED

CLOUD_PROVIDER_RESOURCE_STOCKOUT_DUE_TO_MISCONFIG

CLOUD_PROVIDER_LAUNCH_FAILURE

COMMUNICATION_LOST

CONTROL_PLANE_REQUEST_FAILURE / CONTROL_PLANE_REQUEST_FAILURE_DUE_TO_MISCONFIG

DOCKER_CONTAINER_CREATION_EXCEPTION

DOCKER_IMAGE_PULL_FAILURE

DOCKER_IMAGE_TOO_LARGE_FOR_INSTANCE_EXCEPTION

DOCKER_INVALID_OS_EXCEPTION

EOS_SPARK_IMAGE

INSTANCE_POOL_MAX_CAPACITY_REACHED

INSTANCE_POOL_NOT_FOUND

INSTANCE_UNREACHABLE / INSTANCE_UNREACHABLE_DUE_TO_MISCONFIG

INVALID_ARGUMENT

INVALID_WORKER_ENVIRONMENT

NETWORK_CHECK_CONTROL_PLANE_FAILURE / NETWORK_CHECK_CONTROL_PLANE_FAILURE_DUE_TO_MISCONFIG

NETWORK_CHECK_DNS_SERVER_FAILURE / NETWORK_CHECK_DNS_SERVER_FAILURE_DUE_TO_MISCONFIG

NETWORK_CHECK_MULTIPLE_COMPONENTS_FAILURE / NETWORK_CHECK_MULTIPLE_COMPONENTS_FAILURE_DUE_TO_MISCONFIG

NETWORK_CHECK_NIC_FAILURE

NETWORK_CHECK_STORAGE_FAILURE

NETWORK_CONFIGURATION_FAILURE

NPIP_TUNNEL_SETUP_FAILURE

RATE_LIMITED

REQUEST_THROTTLED

SPOT_INSTANCE_TERMINATION

SPARK_IMAGE_DOWNLOAD_FAILURE

SPARK_IMAGE_NOT_FOUND

SPARK_STARTUP_FAILURE

STORAGE_DOWNLOAD_FAILURE_SLOW

STORAGE_DOWNLOAD_FAILURE_THROTTLED

WORKSPACE_CANCELLED_ERROR

WORKSPACE_CONFIGURATION_ERROR