Customer-managed VPC

Important

This feature requires that your account be on the E2 version of the Databricks platform. To deploy a workspace to a customer-managed VPC, you must create the workspace using the Account API or the Account Console for accounts on the E2 version of the platform.

Overview

By default, workspace clusters are created in a single AWS VPC (Virtual Private Cloud) that Databricks creates and configures in your AWS account. You can optionally create your Databricks workspaces in your own VPC, a feature known as customer-managed VPC, which can allow you to exercise more control over the infrastructure and help you comply with specific cloud security and governance standards your organization may require.

A customer-managed VPC is good solution if you have:

  • Information security policies that prevent PaaS providers from creating VPCs in your own AWS account.
  • An approval process to create a new VPC, in which the VPC is configured and secured in a well-documented way by internal information security or cloud engineering teams.

Benefits include:

  • Lower privilege level: You maintain more control of your own AWS account. And you don’t need to grant Databricks as many permissions via cross-account IAM role as you do for a Databricks-managed VPC. For example, there is no need for permission to create VPCs. This limited set of permissions can make it easier to get approval to use Databricks in your platform stack.
  • Simplified network operations: Better network space utilization. Optionally configure smaller subnets for a workspace, compared to the default CIDR /16. And there is no need for the complex VPC peering configurations that might be necessary with other solutions.
  • Consolidation of VPCs: Multiple Databricks workspaces can share a single data plane VPC, which is often preferred for billing and instance management.
  • Limit outgoing connections: By default, the data plane does not limit outgoing connections from Databricks Runtime workers. For workspaces that are configured to use a customer-managed VPC, you can use an egress firewall or proxy appliance to limit outbound traffic to a list of allowed internal or external data sources.
Customer-managed VPC

To take advantage of customer-managed VPC, you must specify that VPC when you first create the Databricks workspace. You cannot move an existing Databricks-managed workspace to your own VPC. Nor can you move an existing workspace from one of your own VPCs to another.

Configure a workspace to use your own VPC

To deploy a workspace in your own VPC, you must:

  1. Create the VPC following the requirements enumerated in VPC requirements.

  2. Register your VPC network configuration with Databricks when you create the workspace using the Account API. See Step 3: Configure customer-managed VPC (optional).

    You will need to provide the VPC ID, subnet IDs, and security group ID during registration.

VPC requirements

Your VPC must meet the requirements described in this section in order to host a Databricks workspace.

VPC region

Workspace data plane VPCs can be in AWS regions ap-northeast-1, ap-south-1, ap-southeast-2, ca-central-1, eu-west-1, eu-central-1, us-east-1, us-east-2, us-west-1, and us-west-2. However, you cannot use a VPC in us-west-1 if you want to use customer-managed keys to encrypt notebooks.

VPC sizing

You can share one VPC with multiple workspaces in a single AWS account. However, you cannot reuse subnets or security groups between workspaces. Be sure to size your VPC and subnets accordingly. Databricks assigns two IP addresses per node, one for management traffic and one for Apache Spark applications. The total number of instances for each subnet is equal to half the number of IP addresses that are available. Learn more in Subnets.

VPC IP address ranges

VPC CIDRs 10.2.0.0/16 and 10.3.0.0/16 are used for the internal container network for clusters. Therefore you cannot use them to configure a workspace VPC.

Important

If you have configured secondary CIDR blocks for your VPC, make sure that the subnets for the Databricks workspace are configured with the same VPC CIDR block.

DNS

The VPC must have DNS hostnames and DNS resolution enabled.

Subnets

Databricks must have access to at least two subnets for each workspace, with each subnet in a different availability zone. You cannot specify more than one Databricks workspace subnet per Availability Zone in the Create network configuration API call. You can have more than one subnet per availability zone as part of your network setup, but you can choose only one subnet per Availability Zone for the Databricks workspace.

Databricks assigns two IP addresses per node, one for management traffic and one for Spark applications. The total number of instances for each subnet is equal to half of the number of IP addresses that are available.

Each subnet must have a netmask between /17 and /26.

Important

The subnets that you specify for a customer-managed VPC must be reserved for one Databricks workspace only. You cannot share these subnets with any other resources, including other Databricks workspaces.

Additional subnet requirements

  • Subnets must be private.
  • Subnets must have outbound access to the public network using a NAT gateway and internet gateway, or other similar customer-managed appliance infrastructure.
  • The NAT gateway must be set up in its own subnet that routes quad-zero (0.0.0.0/0) traffic to an internet gateway or other customer-managed appliance infrastructure.

Important

A workspace using secure cluster connectivity (the default after September 1, 2020) must have outbound access from the VPC to the public network.

Subnet route table

The route table for workspace subnets must have quad-zero (0.0.0.0/0) traffic that targets the appropriate network device. If the workspace uses secure cluster connectivity (which is the default for new workspaces after September 1, 2020), quad-zero traffic must target a NAT Gateway or your own managed NAT device or proxy appliance.

Note

Databricks requires subnets to add 0.0.0.0/0 to your allow list. To control egress traffic, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to. See Firewall appliance infrastructure.

This is a base guideline only. Your configuration requirements may differ. For questions, contact your Databricks representative.

Security groups

Databricks must have access to at least one AWS security group and no more than five security groups. You can reuse existing security groups rather than create new ones.

Security groups must have the following rules:

Egress (outbound): All allowed

Ingress (inbound): Required for all workspaces (these can be separate rules or combined into one):

  • Allow TCP on all ports when traffic source uses the same security group
  • Allow UDP on all ports when traffic source uses the same security group

Subnet-level network ACLs

Subnet-level network ACLs must not deny ingress or egress to any traffic. Databricks validates for the following rules while creating the workspace:

  • ALLOW ALL from Source 0.0.0.0/0
  • ALLOW ALL to Destination 0.0.0.0/0

Note

Databricks requires subnet-level network ACLs to add 0.0.0.0/0 to your allow list. To control egress traffic, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to. See Firewall appliance infrastructure.

Firewall appliance infrastructure

If you are using secure cluster connectivity (the default as of September 1, 2020), use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to:

  • If the firewall or proxy appliance is in the same VPC as the Databricks workspace VPC, route the traffic and configure it to allow the following connections.
  • If the firewall or proxy appliance is in a different VPC or an on-premise network, route 0.0.0.0/0 to that VPC or network first and configure the proxy appliance to allow the following connections.

Allow the following connections:

  • Databricks web application: Required. Also used for REST API calls to your workspace.
  • Databricks secure cluster connectivity (SCC) relay: Required if your workspace uses secure cluster connectivity, which is the default for workspaces in accounts on the E2 version of the platform as of September 1, 2020.
  • AWS S3 global URL: Required by Databricks to access the root S3 bucket.
  • AWS S3 regional URL: Optional. However, you likely use other S3 buckets, in which case you must also allow the S3 regional endpoint. Databricks recommends creating an S3 VPC endpoint instead so that this traffic goes through the private tunnel over the AWS network backbone.
  • AWS STS global URL: Required.
  • AWS STS regional URL: Required due to expected switch to regional endpoint.
  • AWS Kinesis regional URL: Kinesis endpoint is used to capture logs needed to manage and monitor the software. For most regions, use the regional URL. However, for VPCs in us-west-1, the VPC endpoint will not come into effect today and you must ensure that the Kinesis URL is allowed for us-west-2 (not us-west-1). Databricks recommends that you create a Kinesis VPC endpoint instead so that this traffic goes through the private tunnel over the AWS network backbone.
  • Table metastore RDS regional URL (by data plane region): Required if your Databricks workspace uses the default Hive metastore, which is always in the same region as your data plane region. This means that it might be in the same geography but different region as the control plane. Instead of using the default Hive metastore, you can choose to implement your own table metastore instance, in which case you are responsible for its network routing.

Allow the following connections by data plane region:

Endpoint VPC region Address Port
Webapp ap-northeast-1 tokyo.cloud.databricks.com 443
  ap-south-1 mumbai.cloud.databricks.com 443
  ap-southeast-2 sydney.cloud.databricks.com 443
  ca-central-1 canada.cloud.databricks.com 443
  eu-central-1 frankfurt.cloud.databricks.com 443
  eu-west-1 ireland.cloud.databricks.com 443
  us-east-1 nvirginia.cloud.databricks.com 443
  us-east-2 ohio.cloud.databricks.com 443
  us-west-1 oregon.cloud.databricks.com 443
  us-west-2 oregon.cloud.databricks.com 443
SCC relay ap-northeast-1 tunnel.ap-northeast-1.cloud.databricks.com 443
  ap-south-1 tunnel.ap-south-1.cloud.databricks.com 443
  ap-southeast-2 tunnel.ap-southeast-2.cloud.databricks.com 443
  ca-central-1 tunnel.ca-central-1.cloud.databricks.com 443
  eu-central-1 tunnel.eu-central-1.cloud.databricks.com 443
  eu-west-1 tunnel.eu-west-1.cloud.databricks.com 443
  us-east-1 tunnel.us-east-1.cloud.databricks.com 443
  us-east-2 tunnel.us-east-2.cloud.databricks.com 443
  us-west-1 tunnel.cloud.databricks.com 443
  us-west-2 tunnel.cloud.databricks.com 443
S3 global for root bucket all s3.amazonaws.com 443
S3 regional for other buckets: Databricks recommends a VPC endpoint instead all s3.<region-name>.amazonaws.com 443
STS global all sts.amazonaws.com 443
Kinesis: Databricks recommends a VPC endpoint instead Most regions kinesis.<region-name>.amazonaws.com 443
  us-west-1 kinesis.us-west-2.amazonaws.com 443
RDS (if using built-in metastore) ap-northeast-1 mddx5a4bpbpm05.cfrfsun7mryq.ap-northeast-1.rds.amazonaws.com 3306
  ap-south-1 mdjanpojt83v6j.c5jml0fhgver.ap-south-1.rds.amazonaws.com 3306
  ap-southeast-2 mdnrak3rme5y1c.c5f38tyb1fdu.ap-southeast-2.rds.amazonaws.com 3306
  ca-central-1 md1w81rjeh9i4n5.co1tih5pqdrl.ca-central-1.rds.amazonaws.com 3306
  eu-central-1 mdv2llxgl8lou0.ceptxxgorjrc.eu-central-1.rds.amazonaws.com 3306
  eu-west-1 md15cf9e1wmjgny.cxg30ia2wqgj.eu-west-1.rds.amazonaws.com 3306
  us-east-1 mdb7sywh50xhpr.chkweekm4xjq.us-east-1.rds.amazonaws.com 3306
  us-east-2 md7wf1g369xf22.cluz8hwxjhb6.us-east-2.rds.amazonaws.com 3306
  us-west-1 mdzsbtnvk0rnce.c13weuwubexq.us-west-1.rds.amazonaws.com 3306
  us-west-2 mdpartyyphlhsp.caj77bnxuhme.us-west-2.rds.amazonaws.com 3306
Databricks control plane infrastructure ap-northeast-1 35.72.28.0/28 443
  ap-south-1 65.0.37.64/28 443
  ap-southeast-2 3.26.4.0/28 443
  ca-central-1 3.96.84.208/28 443
  eu-west-1 3.250.244.112/28 443
  eu-central-1 18.159.44.32/28 443
  us-east-1 3.237.73.224/28 443
  us-east-2 3.128.237.208/28 443
  us-west-1 and us-west-2 44.234.192.32/28 443

Regional endpoints

If you use a customer-managed VPC (optional) and secure cluster connectivity (the default as of September 1, 2020), you may prefer to configure your VPC to use only regional VPC endpoints to AWS services for more direct connections and reduced cost compared to AWS global endpoints. There are four AWS services that a Databricks workspace with a customer-managed VPC must reach: STS, S3, Kinesis, and RDS.

The connection from your VPC to the RDS service is required only if you use the default Databricks metastore. Although there is no VPC endpoint for RDS, instead of using the default Databricks metastore, you can configure your own external metastore. Implement an external metastore with Hive metastore or AWS Glue.

For the other three services, you can create VPC gateway or interface endpoints such that the relevant in-region traffic from clusters could transit over the secure AWS backbone rather than the public network:

  • S3: Create a VPC gateway endpoint that is directly accessible from your Databricks workspace cluster subnets. This causes workspace traffic to all in-region S3 buckets to use the endpoint route. To access any cross-region buckets, open up access to S3 global URL s3.amazonaws.com in your egress appliance, or route 0.0.0.0/0 to an AWS internet gateway.

    To use DBFS FUSE with regional endpoints enabled:

    • For Databricks Runtime 6.x and above, you must set up an environment variable in the cluster configuration to set AWS_REGION=<aws-region-code>. For example, if your workspace is deployed in the N. Virginia region, set AWS_REGION=us-east-1. To enforce it for all clusters, use cluster policies.
    • For Databricks Runtime 5.5 LTS, set both AWS_REGION=<aws-region-code> and DBFS_FUSE_VERSION=2.
  • STS: Create a VPC interface endpoint directly accessible from your Databricks workspace cluster subnets. You can create this endpoint in your workspace subnets. Databricks recommends that you use the same security group that was created for your workspace VPC. This configuration causes workspace traffic to STS to use the endpoint route.

  • Kinesis: Create a VPC interface endpoint directly accessible from your Databricks workspace cluster subnets. You can create this endpoint in your workspace subnets. Databricks recommends that you use the same security group that was created for your workspace VPC. This configuration causes workspace traffic to Kinesis to use the endpoint route. The only exception to this rule is that for workspaces in the AWS region us-west-1 this is not true because target Kinesis streams in this case are cross-region to region us-west-2.

Troubleshoot regional endpoints

If the VPC endpoints do not work as intended, for example if your data sources are inaccessible or if the traffic is bypassing the endpoints, use one of the following approaches:

  1. Add the environment variable AWS_REGION in the cluster configuration and set it to your AWS region. To enable it for all clusters, use cluster policies. You may have already configured this environment variable to use DBFS FUSE.

  2. Add the following Apache Spark configuration flags in each source notebook or in the Apache Spark config in cluster configuration:

    • spark.fs.s3a.endpoint https://s3.<region>.amazonaws.com
    • spark.fs.s3a.stsAssumeRole.stsEndpoint https://s3.<region>.amazonaws.com

    To set these flags for all clusters, make sure to configure the flags as part of your cluster policies.

(Optional) Restrict outbound access

If you perform this optional step to restrict outbound access, be aware that your workspace does not have cross-region access to AWS S3.

To restrict outbound access:

  1. Explicitly allow connectivity for the web application, secure cluster connectivity relay, and metastore endpoints in your egress routing infrastructure that may be comprised of a NAT Gateway or third-party appliance with internet connectivity. You can find the URLs in this article in the section Firewall appliance infrastructure.

    Important

    Databricks strongly recommends that you specify destinations as domain names in your egress infrastructure, rather than as IP addresses.

  2. Remove the destination 0.0.0.0/0 with the NAT Gateway target. You might not have this destination, depending on how you created your VPC. If you do not have it, skip this step.

(Optional) Access S3 mounts using instance profiles

To access S3 mounts using instance profiles, set the following Spark configurations. You can set these at either the notebook or cluster level:

sc.hadoopConfiguration.set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com"))
sc.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")

Tip

To enable this Spark configuration consistently across your workspace, create a cluster policy.

Warning

For the S3 service, there are limitations to applying additional regional endpoint configurations at the notebook or cluster level. Notably, access to cross-region S3 access is blocked, even if the global S3 URL is allowed in your egress firewall or proxy. If your Databricks deployment might require cross-region S3 access, it is important that you not apply the Spark configuration at the notebook or cluster level.