Skip to main content

Classic compute termination error codes

This article provides troubleshooting guidance for common cluster termination error codes. Use the error code from your cluster event log to find the relevant cause and recommended fix.

AWS_INSUFFICIENT_FREE_ADDRESSES_IN_SUBNET_FAILURE

The AWS subnet has insufficient free IP addresses to launch the requested instances.

Example error message

Not enough free addresses in subnet subnet-[REDACTED] (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameterValue; Request ID: [REDACTED]; Proxy: null)

Troubleshooting steps

  1. Check the subnet CIDR range and available IP addresses in the AWS Console.
  2. Review the number of instances currently running in the subnet.
  3. Check for unused elastic network interfaces that may be consuming IP addresses.
  4. Verify whether there are IP address reservations in the subnet.

Recommended fix

Update your cluster to use a different availability zone with sufficient IP addresses, use the auto availability zone setting, expand the subnet CIDR range, or clean up unused network resources. If the issue persists, contact Databricks support.

AWS_INSUFFICIENT_INSTANCE_CAPACITY_FAILURE

AWS does not have sufficient capacity for the requested instance type in the selected availability zone.

Example error messages

We currently do not have sufficient c4.8xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get c4.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1e, us-east-1f.
There is no Spot capacity available that matches your request. (Service: AmazonEC2; Status Code: 500; Error Code: InsufficientInstanceCapacity; Request ID: [REDACTED]; Proxy: null)

Troubleshooting steps

  1. Verify the instance type and availability zone in your cluster configuration.
  2. Check whether the issue affects spot instances only or also affects on-demand instances.
  3. Review the AWS Service Health Dashboard for known capacity issues.
  4. Test with different instance types in the same family.

Recommended fix

Try launching in a different availability zone, use the auto availability zone setting, switch to a different instance type, or use on-demand instances instead of spot. For persistent capacity issues, contact AWS support.

AWS_RESOURCE_QUOTA_EXCEEDED

The cluster launch would exceed the AWS account's quota for the requested resource type.

Troubleshooting steps

  1. Check the AWS Service Quotas console for current limits and usage.
  2. Identify which specific quota is exceeded (instances, volumes, IPs, and so on).
  3. Review resource usage across all regions.
  4. Check for resources that can be cleaned up.

Recommended fix

Request a quota increase through the AWS Service Quotas console, clean up unused resources, distribute workloads across regions, or use different instance types. Contact AWS support for quota increase requests.

BOOTSTRAP_TIMEOUT_DUE_TO_MISCONFIG

The VM bootstrap process timed out due to network connectivity issues, slow artifact downloads, or issues with the cloud provider. The bootstrap timeout is 700 seconds.

Example error message

[id: InstanceId([REDACTED]), status: INSTANCE_INITIALIZING, ...] with threshold 700 seconds timed out after 703891 milliseconds. Instance bootstrap inferred timeout reason: UnknownReason

Troubleshooting steps

  1. Check connectivity to Databricks artifact storage.
  2. Verify connectivity to the Databricks control plane.
  3. Check DNS resolution for Databricks endpoints.
  4. Verify firewall and security group rules.
  5. Test whether the issue is consistent or intermittent.

Recommended fix

Ensure network connectivity to Databricks storage and control plane. Configure service endpoints or VPC endpoints for better network performance. Review firewall, DNS, and routing configuration. Contact Databricks support if network configuration is verified, but timeouts persist.

CONTROL_PLANE_REQUEST_FAILURE_DUE_TO_MISCONFIG

VMs cannot reach the Databricks control plane due to DNS resolution failures, firewall rules, or network misconfiguration.

Example error message

Network health check reported that instance is unable to reach Databricks Control Plane. Please check that instances have connectivity to the Databricks Control Plane. Instance bootstrap inferred timeout reason: NetworkHealthCheck_CP_Failed

Troubleshooting steps

  1. Decode any Base64-encoded error messages in the cluster event log.
  2. Check DNS settings in your network configuration.
  3. Review firewall rules and network security settings.
  4. Test control plane connectivity from a VM in the same network.
  5. Verify custom DNS servers are functional and reachable.

Recommended fix

Verify DNS server configuration and reachability. Ensure firewall rules allow outbound traffic to the Databricks control plane.

Contact Databricks support if the network configuration appears correct, but the issue persists.

DOCKER_IMAGE_PULL_FAILURE

The cluster failed to download the Docker image from the container registry due to network, authentication, or configuration issues.

Example error message

Failed to pull docker image: authentication required

Troubleshooting steps

  1. Verify the Docker image name and tag are correct in the cluster configuration.
  2. Check network connectivity to the container registry from the workspace.
  3. Test registry access from a VM in the same network.
  4. Verify authentication credentials for private registries.
  5. Review node daemon logs for detailed error messages.

Recommended fix

Correct the Docker image configuration and verify authentication credentials. Ensure network rules allow access to the container registry.

For AWS ECR, configure VPC endpoints to avoid routing artifact downloads through the public internet.

Contact Databricks support if the configuration appears correct, but the issue persists.

DOCKER_IMAGE_TOO_LARGE_FOR_INSTANCE_EXCEPTION

The Docker image size exceeds the available disk space on the selected instance type.

Example error message

Failed to launch container as the docker image is too large for the instance.

Troubleshooting steps

  1. Check the Docker image size.
  2. Review the instance type's disk capacity.
  3. Identify unnecessary layers or files in the Docker image.
  4. Check whether multiple large images are being used.

Recommended fix

Use an instance type with a larger disk capacity, optimize the Docker image by removing unnecessary files and layers, use multi-stage builds to reduce image size, or split functionality across multiple smaller images. Contact Databricks support for assistance with image optimization.

EOS_SPARK_IMAGE

The Databricks Runtime (DBR) version configured for the cluster has reached end of support (EOS).

Example error message

Spark image release__11.0.x-snapshot-cpu-ml-scala2.12__databricks-universe__head__[REDACTED]__format-2 does not exist with exit code 2

Troubleshooting steps

  1. Check the DBR version in the cluster configuration.
  2. Review the Databricks Runtime release notes for EOS dates.
  3. Identify which DBR versions are currently supported.
  4. Check whether notebooks or jobs have DBR version dependencies.

Recommended fix

Update the cluster configuration to use a supported Databricks Runtime version. Review compatibility requirements for libraries and code before deploying to production. Contact Databricks support if you need assistance with DBR migration.

INSTANCE_POOL_MAX_CAPACITY_REACHED

The instance pool has reached its configured maximum capacity limit and cannot provide additional instances.

Example error message

Instance pool is full, please consider increasing the pool size

Troubleshooting steps

  1. Check the instance pool configuration for the maximum capacity setting.
  2. Review how many instances are currently in use from the pool.
  3. Identify which clusters are using the pool.
  4. Check whether there are idle instances that can be freed.

Recommended fix

Increase the instance pool maximum capacity, create additional instance pools to distribute load, terminate idle clusters using the pool, or configure clusters to use different pools. Review pool sizing based on concurrent workload requirements.

INSTANCE_UNREACHABLE_DUE_TO_MISCONFIG

Instances are unreachable due to network misconfiguration, firewall rules, or connectivity issues.

Example error message

Bootstrap completes in the VM but control plane failed to reach the node. Please review your network configuration or firewall settings to allow Databricks to reach the node.

Troubleshooting steps

  1. Review firewall rules and network security settings for required inbound ports.
  2. Test connectivity from the control plane to the instance network.
  3. Check for asymmetric routing issues.
  4. Review firewall logs for dropped connections.
  5. Verify that instances have the correct security group assignments.

Recommended fix

Ensure security groups or NSGs allow required inbound traffic from the Databricks control plane. Verify that route tables enable bidirectional communication. Contact Databricks support for assistance with network connectivity troubleshooting.

INVALID_ARGUMENT

Invalid configuration parameters, missing secrets, incorrect permissions, or misconfigured cluster settings prevented the cluster from starting.

Example error message

com.databricks.backend.manager.secret.SecretPermissionDeniedException: User does not have permission with scope: [REDACTED] and key: [REDACTED]

Troubleshooting steps

  1. Review the error message to identify the specific invalid parameter.
  2. For secret errors, verify the secret scope and key exist using the Databricks Secrets API.
  3. Check user or service principal permissions for accessing secrets.
  4. Review the cluster configuration for syntax errors.
  5. Check init scripts for configuration errors.

Recommended fix

Correct the invalid parameter based on the error message. For secrets, verify scope and key existence, check permissions, and ensure network connectivity to secret providers. Validate all cluster configuration against the documentation. Contact Databricks support if the configuration appears correct.

NETWORK_CHECK_CONTROL_PLANE_FAILURE

A pre-bootstrap network health check failed when attempting to reach the Databricks control plane.

Example error message

Instance failed network health check before bootstrapping with fatal error: X_NHC_CONTROL_PLANE_UNREACHABLE
1 failed component(s): control_plane
Retryable: true

Troubleshooting steps

  1. Review cluster event logs for specific connection failure details.
  2. Test control plane connectivity from a VM in the same network.
  3. Check whether a firewall is intercepting or blocking traffic.

Recommended fix

Verify that security group or NSG rules allow outbound traffic to the Databricks control plane. If using UDR with a firewall, ensure Databricks service tags route to the internet. Contact Databricks support if network configuration is verified correct.

NETWORK_CONFIGURATION_FAILURE

A network configuration error is preventing proper VM or cluster network setup.

Troubleshooting steps

  1. Review firewall and security group or NSG rules.
  2. Check route tables and routing configuration.
  3. Verify subnet configuration.
  4. Check for IP address conflicts.
  5. Verify DNS settings.

Recommended fix

Correct the network configuration based on the specific error. Ensure security group or NSG rules allow required traffic, verify that subnet CIDR ranges don't overlap, check that route tables are properly configured, and ensure DNS is functional. Contact Databricks support for network configuration review.

REQUEST_THROTTLED

API requests to the cloud provider are being throttled due to rate limiting.

Example error message

TEMPORARILY_UNAVAILABLE: Too many requests from workspace [REDACTED]

Troubleshooting steps

  1. Check whether multiple clusters are launching simultaneously.
  2. Review API request rate limits for your account.
  3. Identify whether other services are making concurrent API calls.
  4. Check for automated systems making frequent requests.

Recommended fix

Reduce concurrent cluster launches, request an API rate limit increase from your cloud provider, implement exponential backoff in automation scripts, or stagger cluster launch times.

SPOT_INSTANCE_TERMINATION

Spot or preemptible instances were terminated by the cloud provider due to capacity needs or pricing changes.

Example error message

Server.SpotInstanceTermination: Spot instance termination

Troubleshooting steps

  1. Check the cluster event logs for the termination timestamp.
  2. Review spot pricing history in your region.
  3. Identify whether terminations occur at specific times.
  4. Check whether multiple instances were terminated simultaneously.

Recommended fix

Switch to on-demand instances for production workloads, implement job retry logic to handle interruptions, or use a mix of on-demand and spot instances. Spot instances are best for fault-tolerant workloads.

STORAGE_DOWNLOAD_FAILURE_SLOW

Downloading artifacts from Databricks storage is failing or too slow due to network connectivity, firewall, or DNS issues.

Example error message

Instance bootstrap failed command: Command_UpdateWorker
Failure message: Trying DNS probe for: https://[REDACTED].blob.core.windows.net/update/worker-artifacts/...

Troubleshooting steps

  1. Check firewall rules for Databricks storage endpoints.
  2. Verify DNS resolution for storage URLs.
  3. Test download speed from a VM in the same network.
  4. Review network bandwidth utilization.
  5. Check for proxy or network inspection devices.
  6. Verify routes to storage endpoints.

Recommended fix

Ensure firewall rules allow access to Databricks storage endpoints.

Configure VPC endpoints for S3 to avoid routing artifact downloads through the public internet.

Review and optimize network inspection devices if present. Contact Databricks support if connectivity to storage endpoints is verified but downloads still fail.

WORKSPACE_CONFIGURATION_ERROR

Workspace-level misconfiguration is preventing cluster launch, including issues with IAM roles or service principal permissions.

Example error message

User: arn:aws:iam::[REDACTED]:user/ConsolidatedManagerIAMUser is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::[REDACTED]:role/databricks-workspace-stack-role

Troubleshooting steps

  1. Review recent changes to workspace configuration.
  2. Check the cloud provider console for policy or permission changes.
  1. Verify the cross-account IAM role trust relationship configuration and instance profile permissions to assume required roles.

Recommended fix

Verify IAM role trust relationships and instance profile permissions. Review workspace security configuration.

Contact Databricks support if the workspace configuration appears correct or if the cross-account role setup needs verification.