Troubleshooting creating workspaces

Overview

The following sections describe configuration errors during workspace creation and how to fix the errors. Most issues apply to workspace creation using both the account console or Account API, with exceptions as indicated.

General errors

Maximum number of VPCs

If you get an error message that mentions the maximum number of VPCs, submit a service limit increase request for the number of VPCs allowed in the region. This error typically happens only if you are using a Databricks-managed VPC, not a customer-managed VPC.

Maximum number of VPC endpoints

If you get an error message that mentions a maximum number of VPC endpoints, submit a service limit increase request for the number of Gateway VPC Endpoints allowed in the region. This error typically happens only if you are using a Databricks-managed VPC, not a customer-managed VPC.

Maximum number of addresses

If you get an error message that mentions a maximum number of addresses, submit a service limit increase request for VPC Elastic IP Addresses allowed in the region. This error typically happens only if you are using a Databricks-managed VPC, not a customer-managed VPC.

Not authorized to perform this operation

If you get an error that you are not authorized to perform this operation, check that your IAM role has all of the necessary policies, as defined in the IAM role article.

Storage configuration error messages

Malformed request: Failed storage configuration validation checks

If you get an error message that mentions failed storage configuration validation checks, your S3 bucket permissions are not properly set up. Follow the steps in the article Create an S3 bucket for workspace deployment to ensure that the S3 bucket permissions are correct.

Credentials configuration error messages

Malformed request: Failed credential configuration validation checks

The list of permissions checks in the error message indicate the likely cause of your problems.

  • If the credential configuration validation fails fewer than ten permission checks, it’s likely that your IAM policy is missing those specific permissions. Copy the correct policy from the article Create an IAM role for workspace deployment.

  • If the credential configuration validation fails ten or more checks, it’s more likely that the trust relationship of the IAM role is incorrectly set up. Verify that the trust relationship of your customer role is set up properly according to instructions in the article Create an IAM role for workspace deployment.

If both your policy and trust relationship appear to be correct, also check the following:

  • Confirm that you include the correct role ARN in the credentials object.

  • Confirm whether you have organization-level service control policies (SCPs) that deny the AssumeRole action or deny EC2/VPC access. If you are unsure, ask your AWS administrator about SCPs.

Network configuration

Subnet is already in use by another network

A subnet in use error generally looks like the following:

MALFORMED_REQUEST: Malformed parameters: subnet_id subnet-xxxxxxxx1 is already used by another Network, subnet_id subnet-xxxxxxxx2 is already used by another Network.

This means that you have a Databricks network configuration that uses these same subnets. To resolve, do one of the following:

  • Delete the previous configuration. If you are using the Account API, use the Delete network configuration API. You can also use the account console to delete the configuration.

  • If that previous configuration is not in use, you can use that previous configuration for your new workspace.

  • If that network configuration is already in use by a running workspace, create new subnets and a new network configuration for your new workspace.

Note that if a previous attempt at workspace creation failed, the related configuration components are not automatically deleted.

No network configuration errors during setup, but errors appear during workspace creation

A network configuration might show errors after attempting to deploy a workspace but it showed no error when you set it up. This is because Databricks performs only basic validations while creating the network object. For example, it checks for unique subnets, unique security groups, and missing fields.

Most meaningful network configuration validation happens only after you attempt to create a new workspace with your new network configuration. If there were errors during workspace deployment, look closely at the network validation error message for details.

A workspace seems to work but its network configuration has status WARNED

Make sure that you can start a cluster, run a data job, and that you don’t have DBFS_DOWN or METASTORE_DOWN showing in your Compute event logs. If there are no such errors in the cluster event log, the WARNED status is not necessarily a problem.

For a new workspace, there are a number of things that Databricks attempts to check. If you do not do a simple routing like Workspace subnets → NAT Gateway → Internet Gateway, then Databricks cannot verify that your network is correct. In such cases, Databricks displays a warning on the network configuration.

Check for subnet route table errors

In your cluster events log, you may see errors like:

subnet: Route Table with ID rtb-xxxxxxxx used for subnet with ID subnet-yyyyyyyyy is missing default route to direct all traffic to the NAT gateway nat-zzzzzzzzzzz.

This error could actually indicate a problem if you are trying to deploy a simple Databricks workspace configuration.

If you do your own egress setup, like routing through a firewall (optionally via a Transit Gateway in a hub-spoke fashion), this error will not necessarily be meaningful.

Another potential reason for this error is you register a NAT subnet as a Databricks subnet for clusters. Remove the NAT subnet from the list of Databricks workspace subnets and re-create the workspace.

Do not add your NAT subnet to the list of subnets in a network configuration

Do not add your NAT subnet to the list of Databricks workspace subnets. The NAT subnet is for the NAT gateway and is not intended as a subnet for deployment of Databricks cluster nodes. When creating a network configuration, only list the two subnets to use for Databricks nodes.

Security Group could not be updated with the latest rules

You may see a cluster log error such as:

Security Group with ID sg-xxxx could not be updated with latest Security Group Rules

Update the IAM role to conform to what we have in the IAM role article. In some cases, the resource for AuthorizeSecurityGroupEgress and similar actions can have comma-separated values. Update these to separate resources instead of one resource:

Correct

"Action": [
    "ec2:AuthorizeSecurityGroupEgress",
    "ec2:AuthorizeSecurityGroupIngress",
    "ec2:RevokeSecurityGroupEgress",
    "ec2:RevokeSecurityGroupIngress"
],
"Resource": [
    "arn:aws:ec2:us-east-1:444:security-group/sg-xxxx",
    "arn:aws:ec2:us-east-1:444:security-group/sg-yyyy",
    "arn:aws:ec2:us-east-1:444:security-group/sg-zzzz"
],

Incorrect

"Resource": ["arn:aws:ec2:us-east-1:444:security-group/sg-xxxx,sg-yyyy,sg-zzzz"],

If you have network setup issues, consider using a Databricks-managed VPC

If you are having network setup issues, you could choose to create the workspace with a Databricks-managed VPC rather than a customer-managed VPC.

Important

You must choose whether to provide a customer-managed VPC when creating your workspace. You cannot change this setting after you successfully create the workspace.

To switch a failed workspace to use a Databricks-managed VPC, you must also use a different cross-account IAM role:

  1. Go to the cross-account IAM role article.

  2. Select and copy the policy labelled Databricks VPC.

  3. Use that policy for workspace creation using the account console or workspace creation using the Account API

    • In the account console, in the network configuration picker, select Databricks-managed.

    • For the Account API, be careful not to include the network_id element, for example:

      {
        "workspace_name": "<workspace-name>",
        "deployment_name": "<deployment-name>",
        "aws_region": "<aws-region>",
        "credentials_id": "<credentials-id>",
        "storage_configuration_id": "<storage-configuration-id>"
      }
      

Diagnose VPC network issues with AWS Reachability Analyzer

AWS’s Reachability Analyzer is a configuration analysis tool you can use to test a source resource and a destination resource in your VPC. You can find it in your AWS Console as VPC Reachability Analyzer.

With Reachability Analyzer, you can spin up a test machine in the Databricks private subnet without logging in. You need to add the source as your EC2 instance and the destination as the Databricks control plane IP address and port. You can then test the connectivity to find the blocking component. For more information, see What is Reachability Analyzer.

Account API specific error messages

The following errors are potentially returned from an Account API request to create the workspace.

Malformed request: Invalid <config> in the HTTP request body

The JSON of your request body is incorrectly formatted. In this error message, <config> refers to either credentials, storage configurations, or networks. Confirm that all special characters are escaped properly in the URL or use a REST API client application such as Postman.

Malformed request: Invalid <config> in body

The JSON of your request body is incorrectly formatted. In this error message, <config> refers to either credentials, storage configurations, or networks. Confirm that all special characters are escaped properly in the URL or use a REST API client application such as Postman.