Customer-managed VPC

Important

This feature requires that your account is on the E2 version of the Databricks platform. All new Databricks accounts and most existing accounts are now E2. If you are unsure which account type you have, contact your Databricks representative.

Important

This article mentions the term data plane, which is the compute layer of the Databricks platform. In the context of this article, data plane refers to the Classic data plane in your AWS account. By contrast, the Serverless data plane that supports Serverless SQL endpoints (Public Preview) runs in the Databricks AWS account. To learn more, see Serverless compute.

Overview

By default, clusters are created in a single AWS VPC (Virtual Private Cloud) that Databricks creates and configures in your AWS account. You can optionally create your Databricks workspaces in your own VPC, a feature known as customer-managed VPC, which can allow you to exercise more control over the infrastructure and help you comply with specific cloud security and governance standards your organization may require.

Important

To configure your workspace to use AWS PrivateLink (Public Preview) for any type of connection, it is required that your workspace use a customer-managed VPC.

A customer-managed VPC is good solution if you have:

  • Information security policies that prevent PaaS providers from creating VPCs in your own AWS account.
  • An approval process to create a new VPC, in which the VPC is configured and secured in a well-documented way by internal information security or cloud engineering teams.

Benefits include:

  • Lower privilege level: You maintain more control of your own AWS account. And you don’t need to grant Databricks as many permissions via cross-account IAM role as you do for a Databricks-managed VPC. For example, there is no need for permission to create VPCs. This limited set of permissions can make it easier to get approval to use Databricks in your platform stack.
  • Simplified network operations: Better network space utilization. Optionally configure smaller subnets for a workspace, compared to the default CIDR /16. And there is no need for the complex VPC peering configurations that might be necessary with other solutions.
  • Consolidation of VPCs: Multiple Databricks workspaces can share a single data plane VPC, which is often preferred for billing and instance management.
  • Limit outgoing connections: By default, the data plane does not limit outgoing connections from Databricks Runtime workers. For workspaces that are configured to use a customer-managed VPC, you can use an egress firewall or proxy appliance to limit outbound traffic to a list of allowed internal or external data sources.
Customer-managed VPC

To take advantage of a customer-managed VPC, you must specify that VPC when you first create the Databricks workspace. You cannot move an existing Databricks-managed workspace to your own VPC. You can, however, move an existing workspace from one of your own VPCs to another. See Update a running workspace.

Configure a workspace to use your own VPC

To deploy a workspace in your own VPC, you must:

  1. Create the VPC following the requirements enumerated in VPC requirements.

  2. Register your VPC network configuration with Databricks when you create the workspace using the Account API. See Step 4: Configure customer-managed VPC (optional, but required if you use PrivateLink).

    You must provide the VPC ID, subnet IDs, and security group ID when you register the VPC with Databricks.

VPC requirements

Your VPC must meet the requirements described in this section in order to host a Databricks workspace.

VPC region

Workspace data plane VPCs can be in AWS regions ap-northeast-1, ap-south-1, ap-southeast-1, ap-southeast-2, ca-central-1, eu-west-1, eu-west-2, eu-central-1, us-east-1, us-east-2, us-west-1, and us-west-2. However, you cannot use a VPC in us-west-1 if you want to use customer-managed keys to encrypt managed services or workspace storage.

VPC sizing

You can share one VPC with multiple workspaces in a single AWS account. However, you cannot reuse subnets or security groups between workspaces. Be sure to size your VPC and subnets accordingly. Databricks assigns two IP addresses per node, one for management traffic and one for Apache Spark applications. The total number of instances for each subnet is equal to half the number of IP addresses that are available. Learn more in Subnets.

VPC IP address ranges

Databricks doesn’t limit netmasks for the workspace VPC, but each workspace subnet must have a netmask between /17 and /26. This means that if your workspace has two subnets and both have a netmask of /26, then the netmask for your workspace VPC must be /25 or smaller.

Important

If you have configured secondary CIDR blocks for your VPC, make sure that the subnets for the Databricks workspace are configured with the same VPC CIDR block.

DNS

The VPC must have DNS hostnames and DNS resolution enabled.

Subnets

Databricks must have access to at least two subnets for each workspace, with each subnet in a different availability zone. You cannot specify more than one Databricks workspace subnet per Availability Zone in the Create network configuration API call. You can have more than one subnet per availability zone as part of your network setup, but you can choose only one subnet per Availability Zone for the Databricks workspace.

Databricks assigns two IP addresses per node, one for management traffic and one for Spark applications. The total number of instances for each subnet is equal to half of the number of IP addresses that are available.

Each subnet must have a netmask between /17 and /26.

Important

The subnets that you specify for a customer-managed VPC must be reserved for one Databricks workspace only. You cannot share these subnets with any other resources, including other Databricks workspaces.

Additional subnet requirements

  • Subnets must be private.
  • Subnets must have outbound access to the public network using a NAT gateway and internet gateway, or other similar customer-managed appliance infrastructure.
  • The NAT gateway must be set up in its own subnet that routes quad-zero (0.0.0.0/0) traffic to an internet gateway or other customer-managed appliance infrastructure.

Important

A workspace using secure cluster connectivity (the default after September 1, 2020) must have outbound access from the VPC to the public network.

Subnet route table

The route table for workspace subnets must have quad-zero (0.0.0.0/0) traffic that targets the appropriate network device. If the workspace uses secure cluster connectivity (which is the default for new workspaces after September 1, 2020), quad-zero traffic must target a NAT Gateway or your own managed NAT device or proxy appliance.

Important

Databricks requires subnets to add 0.0.0.0/0 to your allow list. To control egress traffic, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to. See Firewall appliance infrastructure.

This is a base guideline only. Your configuration requirements may differ. For questions, contact your Databricks representative.

Security groups

Databricks must have access to at least one AWS security group and no more than five security groups. You can reuse existing security groups rather than create new ones.

Security groups must have the following rules:

Egress (outbound):

  • Allow all TCP and UDP access to the workspace security group (for internal traffic)
  • Allow TCP access to 0.0.0.0/0 for these ports:
    • 443: for Databricks infrastructure, cloud data sources, and library repositories
    • 3306: for the metastore
    • 6666: only required if you use PrivateLink

Ingress (inbound): Required for all workspaces (these can be separate rules or combined into one):

  • Allow TCP on all ports when traffic source uses the same security group
  • Allow UDP on all ports when traffic source uses the same security group

Subnet-level network ACLs

Subnet-level network ACLs must not deny ingress or egress to any traffic. Databricks validates for the following rules while creating the workspace:

  • ALLOW ALL from Source 0.0.0.0/0
  • Egress:
    • Allow all traffic to the workspace VPC CIDR, for internal traffic
    • Allow TCP access to 0.0.0.0/0 for these ports:
      • 443: for Databricks infrastructure, cloud data sources, and library repositories
      • 3306: for the metastore
      • 6666: only required if you use PrivateLink

Important

If you configure additional ALLOW or DENY rules for outbound traffic, set the rules required by Databricks to the highest priority (the lowest rule numbers), so that they take precedence.

Note

Databricks requires subnet-level network ACLs to add 0.0.0.0/0 to your allow list. To control egress traffic, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to. See Firewall appliance infrastructure.

Firewall appliance infrastructure

If you are using secure cluster connectivity (the default as of September 1, 2020), use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to:

  • If the firewall or proxy appliance is in the same VPC as the Databricks workspace VPC, route the traffic and configure it to allow the following connections.
  • If the firewall or proxy appliance is in a different VPC or an on-premise network, route 0.0.0.0/0 to that VPC or network first and configure the proxy appliance to allow the following connections.

Allow the following connections:

  • Databricks web application: Required. Also used for REST API calls to your workspace.
  • Databricks secure cluster connectivity (SCC) relay: Required if your workspace uses secure cluster connectivity, which is the default for workspaces in accounts on the E2 version of the platform as of September 1, 2020.
  • AWS S3 global URL: Required by Databricks to access the root S3 bucket.
  • AWS S3 regional URL: Optional. However, you likely use other S3 buckets, in which case you must also allow the S3 regional endpoint. Databricks recommends creating an S3 VPC endpoint instead so that this traffic goes through the private tunnel over the AWS network backbone.
  • AWS STS global URL: Required.
  • AWS STS regional URL: Required due to expected switch to regional endpoint.
  • AWS Kinesis regional URL: Kinesis endpoint is used to capture logs needed to manage and monitor the software. For most regions, use the regional URL. However, for VPCs in us-west-1, the VPC endpoint will not come into effect today and you must ensure that the Kinesis URL is allowed for us-west-2 (not us-west-1). Databricks recommends that you create a Kinesis VPC endpoint instead so that this traffic goes through the private tunnel over the AWS network backbone.
  • Table metastore RDS regional URL (by data plane region): Required if your Databricks workspace uses the default Hive metastore, which is always in the same region as your data plane region. This means that it might be in the same geography but different region as the control plane. Instead of using the default Hive metastore, you can choose to implement your own table metastore instance, in which case you are responsible for its network routing.

Required data plane addresses

Allow connections from the addresses below, for your regions:

Endpoint VPC region Address Port
Webapp ap-northeast-1 tokyo.cloud.databricks.com 443
  ap-south-1 mumbai.cloud.databricks.com 443
  ap-southeast-1 singapore.cloud.databricks.com 443
  ap-southeast-2 sydney.cloud.databricks.com 443
  ca-central-1 canada.cloud.databricks.com 443
  eu-central-1 frankfurt.cloud.databricks.com 443
  eu-west-1 ireland.cloud.databricks.com 443
  eu-west-2 london.cloud.databricks.com 443
  us-east-1 nvirginia.cloud.databricks.com 443
  us-east-2 ohio.cloud.databricks.com 443
  us-west-1 oregon.cloud.databricks.com 443
  us-west-2 oregon.cloud.databricks.com 443
SCC relay ap-northeast-1 tunnel.ap-northeast-1.cloud.databricks.com 443
  ap-south-1 tunnel.ap-south-1.cloud.databricks.com 443
  ap-southeast-1 tunnel.ap-southeast-1.cloud.databricks.com 443
  ap-southeast-2 tunnel.ap-southeast-2.cloud.databricks.com 443
  ca-central-1 tunnel.ca-central-1.cloud.databricks.com 443
  eu-central-1 tunnel.eu-central-1.cloud.databricks.com 443
  eu-west-1 tunnel.eu-west-1.cloud.databricks.com 443
  eu-west-2 tunnel.eu-west-2.cloud.databricks.com 443
  us-east-1 tunnel.us-east-1.cloud.databricks.com 443
  us-east-2 tunnel.us-east-2.cloud.databricks.com 443
  us-west-1 tunnel.cloud.databricks.com 443
  us-west-2 tunnel.cloud.databricks.com 443
S3 global for root bucket all s3.amazonaws.com 443
S3 regional for other buckets: Databricks recommends a VPC endpoint instead all s3.<region-name>.amazonaws.com 443
STS global all sts.amazonaws.com 443
Kinesis: Databricks recommends a VPC endpoint instead Most regions kinesis.<region-name>.amazonaws.com 443
  us-west-1 kinesis.us-west-2.amazonaws.com 443
RDS (if using built-in metastore) ap-northeast-1 mddx5a4bpbpm05.cfrfsun7mryq.ap-northeast-1.rds.amazonaws.com 3306
  ap-south-1 mdjanpojt83v6j.c5jml0fhgver.ap-south-1.rds.amazonaws.com 3306
  ap-southeast-1 md1n4trqmokgnhr.csnrqwqko4ho.ap-southeast-1.rds.amazonaws.com 3306
  ap-southeast-2 mdnrak3rme5y1c.c5f38tyb1fdu.ap-southeast-2.rds.amazonaws.com 3306
  ca-central-1 md1w81rjeh9i4n5.co1tih5pqdrl.ca-central-1.rds.amazonaws.com 3306
  eu-central-1 mdv2llxgl8lou0.ceptxxgorjrc.eu-central-1.rds.amazonaws.com 3306
  eu-west-1 md15cf9e1wmjgny.cxg30ia2wqgj.eu-west-1.rds.amazonaws.com 3306
  eu-west-2 mdio2468d9025m.c6fvhwk6cqca.eu-west-2.rds.amazonaws.com 3306
  us-east-1 mdb7sywh50xhpr.chkweekm4xjq.us-east-1.rds.amazonaws.com 3306
  us-east-2 md7wf1g369xf22.cluz8hwxjhb6.us-east-2.rds.amazonaws.com 3306
  us-west-1 mdzsbtnvk0rnce.c13weuwubexq.us-west-1.rds.amazonaws.com 3306
  us-west-2 mdpartyyphlhsp.caj77bnxuhme.us-west-2.rds.amazonaws.com 3306
Databricks control plane infrastructure ap-northeast-1 35.72.28.0/28 443
  ap-south-1 65.0.37.64/28 443
  ap-southeast-1 13.214.1.96/28 443
  ap-southeast-2 3.26.4.0/28 443
  ca-central-1 3.96.84.208/28 443
  eu-west-1 3.250.244.112/28 443
  eu-west-2 18.134.65.240/28 443
  eu-central-1 18.159.44.32/28 443
  us-east-1 3.237.73.224/28 443
  us-east-2 3.128.237.208/28 443
  us-west-1 and us-west-2 44.234.192.32/28 443

Regional endpoints

If you use a customer-managed VPC (optional) and secure cluster connectivity (the default as of September 1, 2020), you may prefer to configure your VPC to use only regional VPC endpoints to AWS services for more direct connections and reduced cost compared to AWS global endpoints. There are four AWS services that a Databricks workspace with a customer-managed VPC must reach: STS, S3, Kinesis, and RDS.

The connection from your VPC to the RDS service is required only if you use the default Databricks metastore. Although there is no VPC endpoint for RDS, instead of using the default Databricks metastore, you can configure your own external metastore. Implement an external metastore with Hive metastore or AWS Glue.

For the other three services, you can create VPC gateway or interface endpoints such that the relevant in-region traffic from clusters could transit over the secure AWS backbone rather than the public network:

  • S3: Create a VPC gateway endpoint that is directly accessible from your Databricks cluster subnets. This causes workspace traffic to all in-region S3 buckets to use the endpoint route. To access any cross-region buckets, open up access to S3 global URL s3.amazonaws.com in your egress appliance, or route 0.0.0.0/0 to an AWS internet gateway.

    To use DBFS FUSE with regional endpoints enabled:

    • For Databricks Runtime 6.x and above, you must set up an environment variable in the cluster configuration to set AWS_REGION=<aws-region-code>. For example, if your workspace is deployed in the N. Virginia region, set AWS_REGION=us-east-1. To enforce it for all clusters, use cluster policies.
    • For Databricks Runtime 5.5 LTS, set both AWS_REGION=<aws-region-code> and DBFS_FUSE_VERSION=2.
  • STS: Create a VPC interface endpoint directly accessible from your Databricks cluster subnets. You can create this endpoint in your workspace subnets. Databricks recommends that you use the same security group that was created for your workspace VPC. This configuration causes workspace traffic to STS to use the endpoint route.

  • Kinesis: Create a VPC interface endpoint directly accessible from your Databricks cluster subnets. You can create this endpoint in your workspace subnets. Databricks recommends that you use the same security group that was created for your workspace VPC. This configuration causes workspace traffic to Kinesis to use the endpoint route. The only exception to this rule is that for workspaces in the AWS region us-west-1 this is not true because target Kinesis streams in this case are cross-region to region us-west-2.

Troubleshoot regional endpoints

If the VPC endpoints do not work as intended, for example if your data sources are inaccessible or if the traffic is bypassing the endpoints, use one of the following approaches:

  1. Add the environment variable AWS_REGION in the cluster configuration and set it to your AWS region. To enable it for all clusters, use cluster policies. You may have already configured this environment variable to use DBFS FUSE.

  2. Add the required Apache Spark configuration:

    • Either in each source notebook:
    %scala
    sc.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")
    sc.hadoopConfiguration.set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com")
    
    %python
    sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")
    sc._jsc.hadoopConfiguration().set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com")
    
    • Or in the Apache Spark config for the cluster:
    spark.hadoop.fs.s3a.endpoint https://s3.<region>.amazonaws.com
    spark.hadoop.fs.s3a.stsAssumeRole.stsEndpoint https://sts.<region>.amazonaws.com
    

To set these values for all clusters, configure the values as part of your cluster policy.

(Optional) Restrict outbound access

If you perform this optional step to restrict outbound access, be aware that your workspace does not have cross-region access to AWS S3.

To restrict outbound access:

  1. Explicitly allow connectivity for the web application, secure cluster connectivity relay, and metastore endpoints in your egress routing infrastructure that may be comprised of a NAT Gateway or third-party appliance with internet connectivity. You can find the URLs in this article in the section Firewall appliance infrastructure.

    Important

    Databricks strongly recommends that you specify destinations as domain names in your egress infrastructure, rather than as IP addresses.

  2. Remove the destination 0.0.0.0/0 with the NAT Gateway target. You might not have this destination, depending on how you created your VPC. If you do not have it, skip this step.

(Optional) Access S3 mounts using instance profiles

To access S3 mounts using instance profiles, set the following Spark configurations:

  • Either in each source notebook:

    %scala
    sc.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")
    sc.hadoopConfiguration.set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com")
    
    %python
    sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")
    sc._jsc.hadoopConfiguration().set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com")
    
  • Or in the Apache Spark config for the cluster:

    spark.hadoop.fs.s3a.endpoint https://s3.<region>.amazonaws.com
    spark.hadoop.fs.s3a.stsAssumeRole.stsEndpoint https://sts.<region>.amazonaws.com
    

To set these values for all clusters, configure the values as part of your cluster policy.

Warning

For the S3 service, there are limitations to applying additional regional endpoint configurations at the notebook or cluster level. Notably, access to cross-region S3 access is blocked, even if the global S3 URL is allowed in your egress firewall or proxy. If your Databricks deployment might require cross-region S3 access, it is important that you not apply the Spark configuration at the notebook or cluster level.

(Optional) Restrict access to your S3 buckets

Most reads and writes to S3 from are self-contained within the data plane. However, some management operations originate from the control plane, which is managed by Databricks. To limit access to S3 buckets to a specified set of source IP addresses, create an S3 bucket policy. In the bucket policy, include the IP addresses in the aws:SourceIp list. If you use a VPC Endpoint, allow access to it by adding it to the policy’s aws:sourceVpce.

For more information about S3 bucket policies, see Limiting access to specific IP addresses in the Amazon S3 documentation. Working example bucket policies are also included in this topic.

Requirements for bucket policies

Your bucket policy must meet these requirements, to ensure that your clusters start correctly and that you can connect to them:

Required IPs and storage buckets

This table includes information you need when using S3 bucket policies and VPC Endpoint policies to restrict access to your workspace’s S3 buckets.

Region Control plane NAT IP Artifact storage bucket Log storage bucket Shared datasets bucket
ap-northeast-1 18.177.16.95/32 databricks-prod-artifacts-ap-northeast-1 databricks-prod-storage-tokyo databricks-datasets-tokyo
ap-south-1 13.232.248.161/32 databricks-prod-artifacts-ap-south-1 databricks-prod-storage-mumbai databricks-datasets-mumbai
ap-southeast-1 13.213.212.4/32 databricks-prod-artifacts-ap-southeast-1 databricks-prod-storage-singapore databricks-datasets-singapore
ap-southeast-2 13.237.96.217/32 databricks-prod-artifacts-ap-southeast-2 databricks-prod-storage-sydney databricks-datasets-sydney
ca-central-1 35.183.59.105/32 databricks-prod-artifacts-ca-central-1 databricks-prod-storage-montreal databricks-datasets-montreal
eu-central-1 18.159.32.64/32 databricks-prod-artifacts-eu-central-1 databricks-prod-storage-frankfurt databricks-datasets-frankfurt
eu-west-1 46.137.47.49/32 databricks-prod-artifacts-eu-west-1 databricks-prod-storage-ireland databricks-datasets-ireland
eu-west-2 3.10.112.150/32 databricks-prod-artifacts-eu-west-2 databricks-prod-storage-london databricks-datasets-london
us-east-1 54.156.226.103/32 databricks-prod-artifacts-us-east-1 databricks-prod-storage-virginia databricks-datasets-virginia
us-east-2 18.221.200.169/32 databricks-prod-artifacts-us-east-2 databricks-prod-storage-ohio databricks-datasets-ohio
us-west-1 52.27.216.188/32 databricks-prod-artifacts-us-west-2 databricks-prod-storage-oregon databricks-datasets-oregon
us-west-2 52.27.216.188/32 databricks-prod-artifacts-us-west-2 databricks-prod-storage-oregon databricks-datasets-oregon

Example bucket policies

These examples use placeholder text to indicate where to specify recommended IP addresses and required storage buckets. Review the requirements to ensure that your clusters start correctly and that you can connect to them.

Restrict access to the Databricks control plane, data plane, and trusted IPs:

This S3 bucket policy uses a Deny condition to selectively allow access from the control plane, NAT gateway, and corporate VPN IP addresses you specify. Replace the placeholder text with values for your environment. You can add any number of IP addresses to the policy. Create one policy per S3 bucket you want to protect.

Important

If you use VPC Endpoints, this policy is not complete. See Restrict access to the <Databricks> control plane, VPC endpoints, and trusted IPs.

{
  "Sid": "IPAllow",
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:*",
  "Resource": [
    "arn:aws:s3:::<S3-BUCKET>",
    "arn:aws:s3:::<S3-BUCKET>/*"
  ],
  "Condition": {
    "NotIpAddress": {
      "aws:SourceIp": [
        "<CONTROL-PLANE-NAT-IP>",
        "<DATA-PLANE-NAT-IP>",
        "<CORPORATE-VPN-IP>"
      ]
    }
  }
}

Restrict access to the Databricks control plane, VPC endpoints, and trusted IPs:

If you use a VPC Endpoint to access S3, you must add a second condition to the policy. This condition allows access from your VPC Endpoint by adding it to the aws:sourceVpce list.

This bucket selectively allows access from your VPC Endpoint, and from the control plane and corporate VPN IP addresses you specify.

When using VPC Endpoints, you can use a VPC Endpoint policy instead of an S3 bucket policy. A VPCE policy must allow access to your root S3 bucket and also to the required artifact, log, and shared datasets bucket for your region. You can learn about VPC Endpoint policies in the AWS documentation.

Replace the placeholder text with values for your environment.

{
  "Sid": "IPAllow",
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:*",
  "Resource": [
    "arn:aws:s3:::<S3-BUCKET>",
    "arn:aws:s3:::<S3-BUCKET>/*"
  ],
  "Condition": {
    "NotIpAddressIfExists": {
      "aws:SourceIp": [
        "<CONTROL-PLANE-NAT-IP>",
        "<CORPORATE-VPN-IP>"
      ]
    },
    "StringNotEqualsIfExists": {
      "aws:sourceVpce": "<VPCE-ID>"
    }
  }
}