Troubleshooting creating workspaces
Overview
The following sections describe configuration errors during workspace creation and how to fix the errors. Most issues apply to workspace creation using both the account console or Account API, with exceptions as indicated.
Note
This article describes the process for accounts on the E2 version of the Databricks platform. All new Databricks accounts and most existing accounts are now E2. If you are unsure which account type you have, contact your Databricks representative.
Important
This article mentions the term data plane, which is the compute layer of the Databricks platform. In the context of this article, data plane refers to the Classic data plane in your AWS account. By contrast, the serverless data plane that supports serverless SQL warehouses runs in the Databricks AWS account. To learn more, see Serverless compute.
General errors
Maximum number of VPCs
If you get an error message that mentions the maximum number of VPCs, submit a service limit increase request for the number of VPCs allowed in the region. This error typically happens only if you are using a Databricks-managed VPC, not a customer-managed VPC.
Maximum number of VPC endpoints
If you get an error message that mentions a maximum number of VPC endpoints, submit a service limit increase request for the number of Gateway VPC Endpoints allowed in the region. This error typically happens only if you are using a Databricks-managed VPC, not a customer-managed VPC.
Maximum number of addresses
If you get an error message that mentions a maximum number of addresses, submit a service limit increase request for VPC Elastic IP Addresses allowed in the region. This error typically happens only if you are using a Databricks-managed VPC, not a customer-managed VPC.
Not authorized to perform this operation
If you get an error that you are not authorized to perform this operation, check that your IAM role has all of the necessary policies, as defined in the IAM role article.
Storage configuration error messages
Malformed request: Failed storage configuration validation checks
If you get an error message that mentions failed storage configuration validation checks, your S3 bucket permissions are not properly set up. Follow the steps in the article Create an S3 bucket for workspace deployment to ensure that the S3 bucket permissions are correct.
Credentials configuration error messages
Malformed request: Failed credential configuration validation checks
The list of permissions checks in the error message indicate the likely cause of your problems.
If the credential configuration validation fails fewer than ten permission checks, it’s likely that your IAM policy is missing those specific permissions. Copy the correct policy from the article Create an IAM role for workspace deployment.
If the credential configuration validation fails ten or more checks, it’s more likely that the trust relationship of the IAM role is incorrectly set up. Verify that the trust relationship of your customer role is set up properly according to instructions in the article Create an IAM role for workspace deployment.
If both your policy and trust relationship appear to be correct, also check the following:
Confirm that you include the correct role ARN in the credentials object.
Confirm whether you have organization-level service control policies (SCPs) that deny the
AssumeRole
action or deny EC2/VPC access. If you are unsure, ask your AWS administrator about SCPs.
Network configuration
Subnet is already in use by another network
A subnet in use error generally looks like the following:
MALFORMED_REQUEST: Malformed parameters: subnet_id subnet-xxxxxxxx1 is already used by another Network, subnet_id subnet-xxxxxxxx2 is already used by another Network.
This means that you have a Databricks network configuration that uses these same subnets. To resolve, do one of the following:
Delete the previous configuration. If you are using the Account API, use the Delete network configuration API. You can also use the account console to delete the configuration.
If that previous configuration is not in use, you can use that previous configuration for your new workspace.
If that network configuration is already in use by a running workspace, create new subnets and a new network configuration for your new workspace.
Note that if a previous attempt at workspace creation failed, the related configuration components are not automatically deleted.
No network configuration errors during setup, but errors appear during workspace creation
A network configuration might show errors after attempting to deploy a workspace but it showed no error when you set it up. This is because Databricks performs only basic validations while creating the network object. For example, it checks for unique subnets, unique security groups, and missing fields.
Most meaningful network configuration validation happens only after you attempt to create a new workspace with your new network configuration. If there were errors during workspace deployment, look closely at the network validation error message for details.
A new cluster does not respond or “data plane network is misconfigured” event log error
After what seems like a successful workspace deployment, you might notice that your first test cluster does not respond. After approximately 20-30 minutes, check your cluster event log. You might see a error message similar to:
The data plane network is misconfigured. Please verify that the network for your data plane is configured correctly. Error message: Node daemon ping timeout in 600000 ms ...
This message indicates that routing or the firewall is incorrect. Databricks requested EC2 instances for a new cluster, but encountered a long time delay waiting for the EC2 instance to bootstrap and connect to the control plane. The cluster manager terminates the instances and reports this error.
Your network configuration must allow cluster node instances to successfully connect to the Databricks control plane. For a faster troubleshooting technique than using a cluster, you can deploy an EC2 instance into one of the workspace subnets and do typical network troubleshooting steps like nc
, ping
, telnet
, traceroute
, etc. The Relay CNAME for each region is mentioned in the customer-managed VPC article. For the Artifact Storage, ensure that there’s a successful networking path to S3.
For access domains and IPs by region, see Required data plane addresses. For regional endpoints, see Configure regional endpoints (recommended). The following example uses the AWS region eu-west-1
:
# Verify access to the web application
nc -zv ireland.cloud.databricks.com 443
# Verify access to the secure cluster connectivity relay
nc -zv tunnel.eu-west-1.cloud.databricks.com 443
# Verify S3 global and regional access
nc -zv s3.amazonaws.com 443
nc -zv s3.eu-west-1.amazonaws.com 443
# Verify STS global and regional access
nc -zv sts.amazonaws.com 443
nc -zv sts.eu-west-1.amazonaws.com 443
# Verify regional Kinesis access
nc -zv kinesis.eu-west-1.amazonaws.com 443
If these all return correctly, the networking could be configured correctly but there could be another issue if you are using a firewall. The firewall may have deep packet inspection, SSL inspection, or something else that causes Databricks commands to fail. Using an EC2 instance in the Databricks subnet, try the following:
curl -X GET -H 'Authorization: Bearer <token>' \
https://<workspace-name>.cloud.databricks.com/api/2.0/clusters/spark-versions
Replace <token>
with your own personal access token and use the correct URL for your workspace. See the Token management API.
If this request fails, try the -k
option with your request to remove SSL verification. If this works with the -k
option, then the firewall is causing an issue with SSL certificates.
Look at the SSL certificates using the following and replace the domain name with the control plane web application domain for your region:
openssl s_client -showcerts -connect oregon.cloud.databricks.com:443
This command shows the return code and the Databricks certificates. If it returns an error, it’s a sign that your firewall is misconfigured and must be fixed.
Note that SSL issues are not a networking layer issue. Viewing traffic at the firewall will not show these SSL issues. Looking at source and destination requests will work as expected.
A workspace seems to work but its network configuration has status WARNED
Make sure that you can start a cluster, run a data job, and that you don’t have DBFS_DOWN
or METASTORE_DOWN
showing in your Cluster event logs. If there are no such errors in the cluster event log, the WARNED
status is not necessarily a problem.
For a new workspace, there are a number of things that Databricks attempts to check. If you do not do a simple routing like Workspace subnets → NAT Gateway → Internet Gateway, then Databricks cannot verify that your network is correct. In such cases, Databricks displays a warning on the network configuration.
Check for subnet route table errors
In your cluster events log, you may see errors like:
subnet: Route Table with ID rtb-xxxxxxxx used for subnet with ID subnet-yyyyyyyyy is missing default route to direct all traffic to the NAT gateway nat-zzzzzzzzzzz.
This error could actually indicate a problem if you are trying to deploy a simple Databricks workspace configuration.
If you do your own egress setup, like routing through a firewall (optionally via a Transit Gateway in a hub-spoke fashion), this error will not necessarily be meaningful.
Another potential reason for this error is you register a NAT subnet as a Databricks subnet for clusters. Remove the NAT subnet from the list of Databricks workspace subnets and re-create the workspace.
Do not add your NAT subnet to the list of subnets in a network configuration
Do not add your NAT subnet to the list of Databricks workspace subnets. The NAT subnet is for the NAT gateway and is not intended as a subnet for deployment of Databricks cluster nodes. When creating a network configuration, only list the two subnets to use for Databricks nodes.
Problems using your metastore or cluster event log includes METASTORE_DOWN
events
If your workspace seems to be up and you can set up clusters, but you have METASTORE_DOWN
events in your cluster event logs, or if your metastore does not seem to work, confirm if you use a Web Application Firewall (WAF) like Squid proxy. Cluster members must connect to several services that don’t work over a WAF.
Cluster start error: failed to launch Spark container on instance
You might see a cluster log error such as:
Cluster start error: failed to launch spark container on instance ...
Exception: Could not add container for ... with address ....
Timed out with exception after 1 attempts
This cluster log error is likely due to the instance not being able to use STS to get into the root S3 bucket. This usually happens when you are implementing exfiltration protection, using VPC endpoints to lock down communication, or adding a firewall.
To fix, do one of the following:
Change the firewall to allow the global STS endpoint to pass (
sts.amazonaws.com
) as documented in the VPC requirements docs.Use a VPC endpoint to set up the regional endpoint.
To get more information about the error, call the decode-authorization-message
AWS CLI command. For details, see the AWS article for decode-authorization-message. The command looks like:
aws sts decode-authorization-message --encoded-message
You may see this error if you set up a VPC endpoint (VPCE) with a different security group for the STS VPCE than the workspaces. You can either update the security groups to allow resources in each security group to talk to each other or put the STS VPCE in the same security group as the workspace subnets.
Cluster nodes need to use STS to access the root S3 bucket using the customer S3 policy. A network path has to be available to the AWS STS service from Databricks cluster nodes.
Security Group could not be updated with the latest rules
You may see a cluster log error such as:
Security Group with ID sg-xxxx could not be updated with latest Security Group Rules
Update the IAM role to conform to what we have in the IAM role article. In some cases, the resource for AuthorizeSecurityGroupEgress
and similar actions can have comma-separated values. Update these to separate resources instead of one resource:
Correct
"Action": [
"ec2:AuthorizeSecurityGroupEgress",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:RevokeSecurityGroupEgress",
"ec2:RevokeSecurityGroupIngress"
],
"Resource": [
"arn:aws:ec2:us-east-1:444:security-group/sg-xxxx",
"arn:aws:ec2:us-east-1:444:security-group/sg-yyyy",
"arn:aws:ec2:us-east-1:444:security-group/sg-zzzz"
],
Incorrect
"Resource": ["arn:aws:ec2:us-east-1:444:security-group/sg-xxxx,sg-yyyy,sg-zzzz"],
If you have network setup issues, consider using a Databricks-managed VPC
If you are having network setup issues, you could choose to create the workspace with a Databricks-managed VPC rather than a customer-managed VPC.
Important
You must choose whether to provide a customer-managed VPC when creating your workspace. You cannot change this setting after you successfully create the workspace.
To switch a failed workspace to use a Databricks-managed VPC, you must also use a different cross-account IAM role:
Go to the cross-account IAM role article.
Select and copy the policy labelled Databricks VPC.
Use that policy for workspace creation using the account console or workspace creation using the Account API
In the account console, in the network configuration picker, select Databricks-managed.
For the Account API, be careful not to include the
network_id
element, for example:{ "workspace_name": "<workspace-name>", "deployment_name": "<deployment-name>", "aws_region": "<aws-region>", "credentials_id": "<credentials-id>", "storage_configuration_id": "<storage-configuration-id>" }
Diagnose VPC network issues with AWS Reachability Analyzer
AWS’s Reachability Analyzer is a configuration analysis tool you can use to test a source resource and a destination resource in your VPC. You can find it in your AWS Console as VPC Reachability Analyzer.
With Reachability Analyzer, you can spin up a test machine in the Databricks private subnet without logging in. You need to add the source as your EC2 instance and the destination as the Databricks control plane IP address and port. You can then test the connectivity to find the blocking component. For more information, see What is Reachability Analyzer.
Account API specific error messages
The following errors are potentially returned from an Account API request to create the workspace.
Malformed request: Invalid <config>
in the HTTP request body
The JSON of your request body is incorrectly formatted. In this error message, <config>
refers to either credentials, storage configurations, or networks. Confirm that all special characters are escaped properly in the URL or use a REST API client application such as Postman.
Malformed request: Invalid <config>
in body
The JSON of your request body is incorrectly formatted. In this error message, <config>
refers to either credentials, storage configurations, or networks. Confirm that all special characters are escaped properly in the URL or use a REST API client application such as Postman.