Configure a customer-managed VPC

Important

This feature requires that your account is on the E2 version of the Databricks platform. All new Databricks accounts and most existing accounts are now E2.

Important

  • This article mentions the term compute plane, which is the compute layer of the Databricks platform. In the context of this article, compute plane refers to the classic compute plane in your AWS account. By contrast, the serverless compute plane that supports serverless SQL warehouses runs in your Databricks account. To learn more, see Serverless compute.

  • Previously, Databricks referred to the compute plane as the data plane.

Overview

By default, clusters are created in a single AWS VPC (Virtual Private Cloud) that Databricks creates and configures in your AWS account. You can optionally create your Databricks workspaces in your own VPC, a feature known as customer-managed VPC. You can use a customer-managed VPC to exercise more control over your network configurations to comply with specific cloud security and governance standards your organization may require. To configure your workspace to use AWS PrivateLink for any type of connection, your workspace must use a customer-managed VPC.

A customer-managed VPC is good solution if you have:

  • Security policies that prevent PaaS providers from creating VPCs in your own AWS account.

  • An approval process to create a new VPC, in which the VPC is configured and secured in a well-documented way by internal information security or cloud engineering teams.

Benefits include:

  • Lower privilege level: You maintain more control of your own AWS account. And you don’t need to grant Databricks as many permissions via cross-account IAM role as you do for a Databricks-managed VPC. For example, there is no need for permission to create VPCs. This limited set of permissions can make it easier to get approval to use Databricks in your platform stack.

  • Simplified network operations: Better network space utilization. Optionally configure smaller subnets for a workspace, compared to the default CIDR /16. And there is no need for the complex VPC peering configurations that might be necessary with other solutions.

  • Consolidation of VPCs: Multiple Databricks workspaces can share a single compute plane VPC, which is often preferred for billing and instance management.

  • Limit outgoing connections: By default, the compute plane does not limit outgoing connections from Databricks Runtime workers. For workspaces that are configured to use a customer-managed VPC, you can use an egress firewall or proxy appliance to limit outbound traffic to a list of allowed internal or external data sources.

Customer-managed VPC

To take advantage of a customer-managed VPC, you must specify a VPC when you first create the Databricks workspace. You cannot move an existing workspace with a Databricks-managed VPC to use a customer-managed VPC. You can, however, move an existing workspace with a customer-managed VPC from one VPC to another VPC by updating the workspace configuration’s network configuration object. See Update a running or failed workspace.

To deploy a workspace in your own VPC, you must:

  1. Create the VPC following the requirements enumerated in VPC requirements.

  2. Reference your VPC network configuration with Databricks when you create the workspace.

    You must provide the VPC ID, subnet IDs, and security group ID when you register the VPC with Databricks.

VPC requirements

Your VPC must meet the requirements described in this section in order to host a Databricks workspace.

VPC region

For a list of AWS regions that support customer-managed VPC, see Databricks clouds and regions.

VPC sizing

You can share one VPC with multiple workspaces in a single AWS account. However, Databricks recommends using unique subnets and security groups for each workspace. Be sure to size your VPC and subnets accordingly. Databricks assigns two IP addresses per node, one for management traffic and one for Apache Spark applications. The total number of instances for each subnet is equal to half the number of IP addresses that are available. Learn more in Subnets.

VPC IP address ranges

Databricks doesn’t limit netmasks for the workspace VPC, but each workspace subnet must have a netmask between /17 and /26. This means that if your workspace has two subnets and both have a netmask of /26, then the netmask for your workspace VPC must be /25 or smaller.

Important

If you have configured secondary CIDR blocks for your VPC, make sure that the subnets for the Databricks workspace are configured with the same VPC CIDR block.

DNS

The VPC must have DNS hostnames and DNS resolution enabled.

Subnets

Databricks must have access to at least two subnets for each workspace, with each subnet in a different availability zone. You cannot specify more than one Databricks workspace subnet per Availability Zone in the Create network configuration API call. You can have more than one subnet per availability zone as part of your network setup, but you can choose only one subnet per Availability Zone for the Databricks workspace.

You can choose to share one subnet across multiple workspaces or both subnets across workspaces. For example, you can have two workspaces that share the same VPC. One workspace can use subnets A and B and another workspaces can use subnets A and C. If you plan to share subnets across multiple workspaces, be sure to size your VPC and subnets to be large enough to scale with usage.

Databricks assigns two IP addresses per node, one for management traffic and one for Spark applications. The total number of instances for each subnet is equal to half of the number of IP addresses that are available.

Each subnet must have a netmask between /17 and /26.

Additional subnet requirements

  • Subnets must be private.

  • Subnets must have outbound access to the public network using a NAT gateway and internet gateway, or other similar customer-managed appliance infrastructure.

  • The NAT gateway must be set up in its own subnet that routes quad-zero (0.0.0.0/0) traffic to an internet gateway or other customer-managed appliance infrastructure.

Important

A workspace using secure cluster connectivity (the default after September 1, 2020) must have outbound access from the VPC to the public network.

Subnet route table

The route table for workspace subnets must have quad-zero (0.0.0.0/0) traffic that targets the appropriate network device. If the workspace uses secure cluster connectivity (which is the default for new workspaces after September 1, 2020), quad-zero traffic must target a NAT Gateway or your own managed NAT device or proxy appliance.

Important

Databricks requires subnets to add 0.0.0.0/0 to your allow list. This rule must be prioritized. To control egress traffic, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to. See (Optional) Configure a firewall and outbound access.

This is a base guideline only. Your configuration requirements may differ. For questions, contact your Databricks account team.

Security groups

A Databricks workspace must have access to at least one AWS security group and no more than five security groups. You can reuse existing security groups rather than create new ones. However, Databricks recommends using unique subnets and security groups for each workspace.

Security groups must have the following rules:

Egress (outbound):

  • Allow all TCP and UDP access to the workspace security group (for internal traffic)

  • Allow TCP access to 0.0.0.0/0 for these ports:

    • 443: for Databricks infrastructure, cloud data sources, and library repositories

    • 3306: for the metastore

    • 6666: for secure cluster connectivity. This is only required if you use PrivateLink.

    • 2443: Supports FIPS encryption. Only required if you enable the compliance security profile.

    • 8443 through 8451: Future extendability. Ensure these ports are open by January 31, 2024.

Ingress (inbound): Required for all workspaces (these can be separate rules or combined into one):

  • Allow TCP on all ports when traffic source uses the same security group

  • Allow UDP on all ports when traffic source uses the same security group

Subnet-level network ACLs

Subnet-level network ACLs must not deny ingress or egress to any traffic. Databricks validates for the following rules while creating the workspace:

Egress (outbound):

  • Allow all traffic to the workspace VPC CIDR, for internal traffic

    • Allow TCP access to 0.0.0.0/0 for these ports:

      • 443: for Databricks infrastructure, cloud data sources, and library repositories

      • 3306: for the metastore

      • 6666: only required if you use PrivateLink

Important

If you configure additional ALLOW or DENY rules for outbound traffic, set the rules required by Databricks to the highest priority (the lowest rule numbers), so that they take precedence.

Ingress (inbound):

  • ALLOW ALL from Source 0.0.0.0/0. This rule must be prioritized.

Note

Databricks requires subnet-level network ACLs to add 0.0.0.0/0 to your allow list. To control egress traffic, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to. See (Optional) Configure a firewall and outbound access.

Create a VPC

To create VPCs you can use various tools:

To use AWS Console, the basic instructions for creating and configuring a VPC and related objects are listed below. For complete instructions, see the AWS documentation.

Note

These basic instructions might not apply to all organizations. Your configuration requirements may differ. This section does not cover all possible ways to configure NATs, firewalls, or other network infrastructure. If you have questions, contact your Databricks account team before proceeding.

  1. Go to the VPCs page in AWS.

  2. See the region picker in the upper-right. If needed, switch to the region for your workspace.

  3. In the upper-right corner, click the orange button Create VPC.

    create new VPC editor
  4. Click VPC and more.

  5. In the Name tag auto-generation type a name for your workspace. Databricks recommends including the region in the name.

  6. For VPC address range, optionally change it if desired.

  7. For public subnets, click 2. Those subnets aren’t used directly by your Databricks workspace, but they are required to enable NATs in this editor.

  8. For private subnets, click 2 for the minimum for workspace subnets. You can add more if desired.

    Your Databricks workspace needs at least two private subnets. To resize them, click Customize subnet CIDR blocks.

  9. For NAT gateways, click In 1 AZ.

  10. Ensure the following fields at the bottom are enabled: Enable DNS hostnames and Enable DNS resolution.

  11. Click Create VPC.

  12. When viewing your new VPC, click on the left navigation items to update related settings on the VPC. To make it easier to find related objects, in the Filter by VPC field, select your new VPC.

  13. Click Subnets and what AWS calls the private subnets labeled 1 and 2, which are the ones you will use to configure your main workspace subnets. Modify the subnets as specified in VPC requirements.

    If you created an extra private subnet for use with PrivateLink, configure private subnet 3 as specified in Enable AWS PrivateLink.

  14. Click Security groups and modify the security group as specified in Security groups.

    If you will use back-end PrivateLink connectivity, create an additional security group with inbound and outbound rules as specified in the PrivateLink article in the section Step 1: Configure AWS network objects.

  15. Click Network ACLs and modify the network ACLs as specified in Subnet-level network ACLs.

  16. Choose whether to perform the optional configurations that are specified later in this article.

  17. Register your VPC with Databricks to create a network configuration using the account console or by using the Account API.

Updating CIDRs

You might need to, at a later time, update subnet CIDRs that overlap with original subnets.

To update the CIDRs and other workspace objects:

  1. Terminate all running clusters (and other compute resources) that are running in the subnets that need to be updated.

  2. Using the AWS console, delete the subnets to update.

  3. Re-create the subnets with updated CIDR ranges.

  4. Update the route table association for the two new subnets. You can reuse the ones in each availability zone for existing subnets.

    Important

    If you skip this step or misconfigure the route tables, cluster may fail to launch.

  5. Create a new network configuration object with the new subnets.

  6. Update the workspace to use this newly created network configuration object

(Optional) Configure a firewall and outbound access

If you are using secure cluster connectivity (the default as of September 1, 2020), use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to:

  • If the firewall or proxy appliance is in the same VPC as the Databricks workspace VPC, route the traffic and configure it to allow the following connections.

  • If the firewall or proxy appliance is in a different VPC or an on-premises network, route 0.0.0.0/0 to that VPC or network first and configure the proxy appliance to allow the following connections.

Important

Databricks strongly recommends that you specify destinations as domain names in your egress infrastructure, rather than as IP addresses.

Allow the following outgoing connections. For each connection type, follow the link to get IP addresses or domains for your workspace region.

  • Databricks web application: Required. Also used for REST API calls to your workspace.

    Webapp addresses

  • Databricks secure cluster connectivity (SCC) relay: Required if your workspace uses secure cluster connectivity, which is the default for workspaces in accounts on the E2 version of the platform as of September 1, 2020.

    SCC relay addresses

  • AWS S3 global URL: Required by Databricks to access the root S3 bucket. Use s3.amazonaws.com:443, regardless of region.

  • AWS S3 regional URL: Optional. If you use S3 buckets that might be in other regions, you must also allow the S3 regional endpoint. Although AWS provides a domain and port for a regional endpoint (s3.<region-name>.amazonaws.com:443), Databricks recommends that you instead use a VPC endpoint so that this traffic goes through the private tunnel over the AWS network backbone. See (Recommended) Configure regional endpoints.

  • AWS STS global URL: Required. Use the following address and port, regardless of region: sts.amazonaws.com:443

  • AWS STS regional URL: Required due to expected switch to regional endpoint. Use a VPC endpoint. See (Recommended) Configure regional endpoints.

  • AWS Kinesis regional URL: Required. The Kinesis endpoint is used to capture logs needed to manage and monitor the software. For the URL for your region, see Kinesis addresses.

  • Table metastore RDS regional URL (by compute plane region): Required if your Databricks workspace uses the default Hive metastore.

    The Hive metastore is always in the same region as your compute plane, but it might be in a different region than the control plane.

    RDS addresses for legacy Hive metastore

    Note

    Instead of using the default Hive metastore, you can choose to implement your own table metastore instance, in which case you are responsible for its network routing.

  • Control plane infrastructure: Required. Used by Databricks for standby Databricks infrastructure to improve the stability of Databricks services.

    Control plane infrastructure addresses

Required compute plane addresses

The contents of this section have moved to IP addresses and domains.

Troubleshoot regional endpoints

If you followed the instructions above and the VPC endpoints do not work as intended, for example, if your data sources are inaccessible or if the traffic is bypassing the endpoints, you can use one of two approaches to add support for the regional endpoints for S3 and STS instead of using VPC endpoints.

  1. Add the environment variable AWS_REGION in the cluster configuration and set it to your AWS region. To enable it for all clusters, use cluster policies. You might have already configured this environment variable to use DBFS mounts.

  2. Add the required Apache Spark configuration. Do exactly one of the following approaches:

    • In each source notebook:

      %scala
      spark.conf.set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com")
      spark.conf.set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")
      
      %python
      spark.conf.set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com")
      spark.conf.set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")
      
    • Alternatively, in the Apache Spark config for the cluster*:

      spark.hadoop.fs.s3a.endpoint https://s3.<region>.amazonaws.com
      spark.hadoop.fs.s3a.stsAssumeRole.stsEndpoint https://sts.<region>.amazonaws.com
      
  3. If you limit egress from the classic compute plane using a firewall or internet appliance, add these regional endpoint addresses to your allow list.

To set these values for all clusters, configure the values as part of your cluster policy.

(Optional) Access S3 using instance profiles

To access S3 mounts using instance profiles, set the following Spark configurations:

  • Either in each source notebook:

    %scala
    spark.conf.set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com")
    spark.conf.set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")
    
    %python
    spark.conf.set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com")
    spark.conf.set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")
    
  • Or in the Apache Spark config for the cluster:

    spark.hadoop.fs.s3a.endpoint https://s3.<region>.amazonaws.com
    spark.hadoop.fs.s3a.stsAssumeRole.stsEndpoint https://sts.<region>.amazonaws.com
    

To set these values for all clusters, configure the values as part of your cluster policy.

Warning

For the S3 service, there are limitations to applying additional regional endpoint configurations at the notebook or cluster level. Notably, access to cross-region S3 access is blocked, even if the global S3 URL is allowed in your egress firewall or proxy. If your Databricks deployment might require cross-region S3 access, it is important that you not apply the Spark configuration at the notebook or cluster level.

(Optional) Restrict access to S3 buckets

Most reads from and writes to S3 are self-contained within the compute plane. However, some management operations originate from the control plane, which is managed by Databricks. To limit access to S3 buckets to a specified set of source IP addresses, create an S3 bucket policy. In the bucket policy, include the IP addresses in the aws:SourceIp list. If you use a VPC Endpoint, allow access to it by adding it to the policy’s aws:sourceVpce.

For more information about S3 bucket policies, see the bucket policy examples in the Amazon S3 documentation. Working example bucket policies are also included in this topic.

Requirements for bucket policies

Your bucket policy must meet these requirements, to ensure that your clusters start correctly and that you can connect to them:

Note

When deploying a new workspace with S3 bucket policy restrictions, you must allow access to the control plane NAT-IP for a us-west region, otherwise the deployment fails. After the workspace is deployed, you can remove the us-west info and update the control plane NAT-IP to reflect your region.

Required IPs and storage buckets

For the IP addresses and domains that you need for configuring S3 bucket policies and VPC Endpoint policies to restrict access to your workspace’s S3 buckets, see Control plane and storage bucket addresses.

Example bucket policies

These examples use placeholder text to indicate where to specify recommended IP addresses and required storage buckets. Review the requirements to ensure that your clusters start correctly and that you can connect to them.

Restrict access to the Databricks control plane, compute plane, and trusted IPs:

This S3 bucket policy uses a Deny condition to selectively allow access from the control plane, NAT gateway, and corporate VPN IP addresses you specify. Replace the placeholder text with values for your environment. You can add any number of IP addresses to the policy. Create one policy per S3 bucket you want to protect.

Important

If you use VPC Endpoints, this policy is not complete. See Restrict access to the Databricks control plane, VPC endpoints, and trusted IPs.

{
  "Sid": "IPDeny",
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:*",
  "Resource": [
    "arn:aws:s3:::<S3-BUCKET>",
    "arn:aws:s3:::<S3-BUCKET>/*"
  ],
  "Condition": {
    "NotIpAddress": {
      "aws:SourceIp": [
        "<CONTROL-PLANE-NAT-IP>",
        "<DATA-PLANE-NAT-IP>",
        "<CORPORATE-VPN-IP>"
      ]
    }
  }
}

Restrict access to the Databricks control plane, VPC endpoints, and trusted IPs:

If you use a VPC Endpoint to access S3, you must add a second condition to the policy. This condition allows access from your VPC Endpoint by adding it to the aws:sourceVpce list.

This bucket selectively allows access from your VPC Endpoint, and from the control plane and corporate VPN IP addresses you specify.

When using VPC Endpoints, you can use a VPC Endpoint policy instead of an S3 bucket policy. A VPCE policy must allow access to your root S3 bucket and also to the required artifact, log, and shared datasets bucket for your region. You can learn about VPC Endpoint policies in the AWS documentation.

Replace the placeholder text with values for your environment.

{
  "Sid": "IPDeny",
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:*",
  "Resource": [
    "arn:aws:s3:::<S3-BUCKET>",
    "arn:aws:s3:::<S3-BUCKET>/*"
  ],
  "Condition": {
    "NotIpAddressIfExists": {
      "aws:SourceIp": [
        "<CONTROL-PLANE-NAT-IP>",
        "<CORPORATE-VPN-IP>"
      ]
    },
    "StringNotEqualsIfExists": {
      "aws:sourceVpce": "<VPCE-ID>"
    }
  }
}