Customer-managed VPC

Important

This feature is not available on all Databricks deployment types and subscriptions. To deploy a workspace to a customer-managed VPC, you must create the workspace using the Multi-workspace API. Contact your Databricks representative to request access.

Overview

By default, workspace clusters are created in a single AWS VPC (Virtual Private Cloud) that Databricks creates and configures in your AWS account. You can optionally create your Databricks workspaces in your own VPC, a feature known as customer-managed VPC, which can allow you to exercise more control over the infrastructure and help you comply with specific cloud security and governance standards your organization may require.

A customer-managed VPC is good solution if you have:

  • Information security policies that prevent PaaS providers from creating VPCs in your own AWS account.
  • An approval process to create a new VPC, in which the VPC is configured and secured in a well-documented way by internal information security or cloud engineering teams.

Benefits include:

  • Lower privilege level: You maintain more control of your own AWS account. And you don’t need to grant Databricks as many permissions via cross-account IAM role as you do for a Databricks-managed VPC. For example, there is no need for permission to create VPCs. This limited set of permissions can make it easier to get approval to use Databricks in your platform stack.
  • Simplified network operations: Better network space utilization. Optionally configure smaller subnets for a workspace, compared to the default CIDR /16. And there is no need for the complex VPC peering configurations that might be necessary with other solutions.
  • Consolidation of VPCs: Multiple Databricks workspaces can share a single data plane VPC, which is often preferred for billing and instance management.
  • Limit outgoing connections: By default, the data plane does not limit outgoing connections from Databricks Runtime workers. For workspaces that are configured to use a customer-managed VPC, you can use an egress firewall or proxy appliance to limit outbound traffic to a list of allowed internal or external data sources.
Customer-managed VPC

To take advantage of customer-managed VPC, you must specify that VPC when you first create the Databricks workspace. You cannot move an existing Databricks-managed workspace to your own VPC. Nor can you move an existing workspace from one of your own VPCs to another.

Configure a workspace to use your own VPC

To deploy a workspace in your own VPC, you must:

  1. Create the VPC following the requirements enumerated in VPC requirements.

  2. Register your VPC network configuration with Databricks when you create the workspace using the Multi-workspace API. See Step 3: Configure customer-managed VPC (optional).

    You will need to provide the VPC ID, subnet IDs, and security group ID during registration.

VPC requirements

Your VPC must meet the requirements described in this section in order to host a Databricks workspace.

Important

If your workspace will use secure cluster connectivity, sometimes called No Public IP (NPIP), the requirements are slightly different. These differences are specified throughout this requirements section. Before you configure your VPC, determine if your workspace will use this feature. Secure cluster connectivity is available only for some deployment types, subscriptions, and tiers. For questions, contact your Databricks representative.

VPC sizing

You can share one VPC with multiple workspaces in a single AWS account. However, you cannot reuse subnets or security groups between workspaces. Be sure to size your VPC and subnets accordingly. Databricks assigns two IP addresses per node, one for management traffic and one for Apache Spark applications. The total number of instances for each subnet is equal to half of the number of IP addresses that are available. Learn more in Subnets.

Important

If you have configured secondary CIDR blocks for your VPC, please make sure that the subnets for Databricks workspace are configured with the same VPC CIDR block.

DNS

The VPC must have DNS hostnames and DNS resolution enabled.

Subnets

Databricks must have access to at least two subnets for each workspace, with each subnet in a different Availability Zone. You cannot specify more than one Databricks workspace subnet per Availability Zone in the Create network configuration API call. You can have more than one subnet per Availability Zone as part of your network setup, but you can choose only one subnet per Availability Zone for the Databricks workspace.

Databricks assigns two IP addresses per node, one for management traffic and one for Spark applications. The total number of instances for each subnet is equal to half of the number of IP addresses that are available.

Each subnet must have a netmask between /17 and /25.

Important

The subnets that you specify for a customer-managed VPC must be reserved for one Databricks workspace only. You cannot share these subnets with any other resources, including other Databricks workspaces.

Additional requirements for secure cluster connectivity

If you plan to use secure cluster connectivity (sometimes called No Public IP or NPIP) for the Databricks workspace:

  • Subnets must be private.
  • Subnets must have outbound access to the public network using a NAT gateway and internet gateway, or other similar customer-managed appliance infrastructure.
  • The NAT gateway must be set up in its own subnet that routes quad-zero (0.0.0.0/0) traffic to an internet gateway or other customer-managed appliance infrastructure.

Important

A workspace using secure cluster connectivity must have outbound access from the VPC to the public network.

Subnet route table

The route table for workspace subnets must have quad-zero (0.0.0.0/0) traffic that targets the appropriate network device:

  • For secure cluster connectivity: quad-zero traffic must target a NAT Gateway.
  • For workspaces without secure cluster connectivity: quad-zero traffic must target the Internet Gateway directly.

Note

Databricks requires subnets to add 0.0.0.0/0 to your allow list. To control egress traffic, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to. See Firewall appliance infrastructure (for secure cluster connectivity only).

This is a base guideline only. Your configuration requirements may differ. For questions, contact your Databricks representative.

Security groups

Databricks must have access to at least one AWS security group and no more than five security groups. You can reuse existing security groups rather than create new ones.

Security groups must have the following rules:

Egress (outbound): All allowed

Ingress (inbound): Required for all workspaces (these can be separate rules or combined into one):

  • Allow TCP on all ports when traffic source uses the same security group
  • Allow UDP on all ports when traffic source uses the same security group

Also required if you are not using secure cluster connectivity:

  • Allow all TCP from Databricks control plane IP addresses on port 22 and 5557:
    • Workspace data plane (VPC) region us-west-2 and us-west-1:
      • 52.27.216.188/32: Control plane NAT instance
      • 52.35.116.98/32: Legacy NAT instance
      • 52.88.31.94/32: Legacy NAT instance
    • Workspace data plane (VPC) region us-east-1:
      • 54.156.226.103/32: Control plane NAT instance
    • Workspace data plane (VPC) region us-east-2:
      • 18.221.200.169/32: Control plane NAT instance

Subnet-level network ACLs

Subnet-level network ACLs must not deny ingress or egress to any traffic. Databricks validates for the following rules while creating the workspace:

  • ALLOW ALL from Source 0.0.0.0/0
  • ALLOW ALL to Destination 0.0.0.0/0

Note

Databricks requires subnet-level network ACLs to add 0.0.0.0/0 to your allow list. To control egress traffic, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to. See Firewall appliance infrastructure (for secure cluster connectivity only).

Firewall appliance infrastructure (for secure cluster connectivity only)

If you are using secure cluster connectivity, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to:

  • If the firewall or proxy appliance is in the same VPC as the Databricks workspace VPC, route the traffic and configure it to allow the following connections.
  • If the firewall or proxy appliance is in a different VPC or an on-premise network, route 0.0.0.0/0 to that VPC or network first and configure the proxy appliance to allow the following connections.

Allow the following connections:

  • Databricks web application: Required. Also used for REST API calls to your workspace.
  • Databricks secure cluster connectivity (SCC) relay: Required if your workspace uses secure cluster connectivity
  • AWS S3 global URL: Required by Databricks to access the root S3 bucket.
  • AWS S3 regional URL: Optional. However, you likely use other S3 buckets, in which case you must also allow the S3 regional endpoint. Databricks recommends creating an S3 VPC endpoint instead so that this traffic goes through the private tunnel over the AWS network backbone.
  • AWS STS global URL: Required.
  • AWS STS regional URL: Required due to expected switch to regional endpoint.
  • AWS Kinesis regional URL: Kinesis endpoint is used to capture logs needed to manage and monitor the software. For most regions, use the regional URL. However, for VPCs in us-west-1, the VPC endpoint will not come into effect today and you must ensure that the Kinesis URL is allowed for us-west-2 (not us-west-1). Databricks recommends that you create a Kinesis VPC endpoint instead so that this traffic goes through the private tunnel over the AWS network backbone.
  • Table metastore RDS regional URL (by data plane region): Required if your Databricks workspace uses the default Hive metastore, which is always in the same region as your data plane region. This means that it might be in the same geography but different region as the control plane. Instead of using the default Hive metastore, you can choose to implement your own table metastore instance, in which case you are responsible for its network routing.

Allow the following connections by data plane region:

Endpoint VPC region Address Port
Webapp us-east-1 nvirginia.cloud.databricks.com 443
  us-east-2 ohio.cloud.databricks.com 443
  us-west-1 oregon.cloud.databricks.com 443
  us-west-2 oregon.cloud.databricks.com 443
SCC relay us-east-1 tunnel.us-east-1.cloud.databricks.com 443
  us-east-2 tunnel.us-east-2.databricks.com 443
  us-west-1 tunnel.cloud.databricks.com 443
  us-west-2 tunnel.cloud.databricks.com 443
S3 global for root bucket all s3.amazonaws.com 443
S3 regional for other buckets: Databricks recommends a VPC endpoint instead us-east-1 s3.us-east-1.amazonaws.com 443
  us-east-2 s3.us-east-2.amazonaws.com 443
  us-west-1 s3.us-west-1.amazonaws.com 443
  us-west-2 s3.us-west-2.amazonaws.com 443
STS global all sts.amazonaws.com 443
Kinesis Databricks recommends a VPC endpoint instead us-east-1 kinesis.us-east-1.amazonaws.com 443
  us-east-2 kinesis.us-east-2.amazonaws.com 443
  us-west-1 kinesis.us-west-2.amazonaws.com 443
  us-west-2 kinesis.us-west-2.amazonaws.com 443
RDS (metastore) us-east-1 mdb7sywh50xhpr.chkweekm4xjq.us-east-1.rds.amazonaws.com 3306
  us-east-2 md7wf1g369xf22.cluz8hwxjhb6.us-east-2.rds.amazonaws.com 3306
  us-west-1 mdzsbtnvk0rnce.c13weuwubexq.us-west-1.rds.amazonaws.com 3306
  us-west-2 mdpartyyphlhsp.caj77bnxuhme.us-west-2.rds.amazonaws.com 3306
Databricks control plane infrastructure us-east-1 3.237.73.224/28 443
  us-east-2 3.128.237.208/28 443
  us-west-1 and us-west-2 44.234.192.32/28 443

Regional endpoints (secure cluster connectivity only)

If you use a customer-managed VPC and secure cluster connectivity, you may prefer to configure your VPC to use only regional VPC endpoints to AWS services for more direct connections and reduced cost compared to AWS global endpoints. There are four AWS services that a Databricks workspace with a customer-managed VPC must reach: STS, S3, Kinesis, and RDS.

The connection from your VPC to the RDS service is required only if you use the default Databricks metastore. Although there is no VPC endpoint for RDS, instead of using the default Databricks metastore, you can configure your own external metastore. Implement an external metastore with Hive metastore or AWS Glue.

For the other three services, this section describes the network behavior using a VPC endpoint, how the behavior can change with Spark configuration at the notebook or cluster level, and how to configure these settings.

Warning

Read this article including all warnings carefully before configuring the VPC or applying Spark configurations. There are important limitations for cross-region access for S3 if you apply regional endpoint Spark configuration at the notebook or cluster level, or if you add the optional outgoing traffic limits. See S3 regional VPC endpoint overview.

STS regional VPC endpoint overview

For data plane access to AWS STS:

  • By default, configuring a VPC endpoint causes traffic to bypass the STS VPC endpoint and egress using the global STS URL. Extra configuration is required to ensure that all traffic goes through the VPC endpoint.

  • If you add a relevant Spark configuration at the notebook or cluster level, traffic egresses over the STS VPC endpoint.

    Note

    Adding a Spark configuration at a notebook or cluster level is not enforced automatically across your workspace. To enable a Spark configuration consistently across your workspace, create a cluster policy.

Databricks recommends that you configure your egress appliances to allow access to the STS global URL to support cases in which access might be used temporarily. For example, after initial workspace creation but before you set up notebook or cluster Spark configurations and cluster policies, Databricks must use the global URL.

S3 regional VPC endpoint overview

For data plane access to S3 for artifact and log storage in the control plane, configuring a VPC endpoint has the following behavior:

  • For regions us-west-2, us-east-1, and us-east-2, this traffic uses the S3 VPC endpoint.
  • For region us-west-1, the S3 VPC endpoint is bypassed because the relevant endpoint is cross-region.

For data plane access to S3 for DBFS root storage and your own S3 data storage in your VPC region:

  • By default, configuring a VPC endpoint causes traffic to bypass the S3 VPC endpoint and egress using the global S3 URL. Extra configuration is required to ensure that all traffic goes through the VPC endpoint.

  • If you add a relevant Spark configuration at the notebook or cluster level, traffic will egress using the S3 VPC endpoint spark.fs.s3a.endpoint https://s3.<region>.amazonaws.com.

    Note

    Adding a Spark configuration at a notebook or cluster level is not enforced automatically across your workspace. To enable a Spark configuration consistently across your workspace, create a cluster policy.

    Warning

    For the S3 service, there are limitations to applying additional regional endpoint configurations at the notebook or cluster level. Notably, access to cross-region S3 access is blocked, even if the global S3 URL is allowed in your egress firewall or proxy.

Databricks recommends that you configure your egress appliances to allow access to the S3 global URL to support cases in which access might be used temporarily. For example, after initial workspace creation but before you set up notebook or cluster Spark configurations and cluster policies, Databricks must use the global URL.

Kinesis regional VPC endpoint overview

For data plane access to Kinesis for logging and usage data endpoints in the control plane:

  • For regions us-west-2, us-east-1, and us-east-2, this traffic uses the Kinesis VPC endpoint.
  • For region us-west-1, the Kinesis VPC endpoint is bypassed because the relevant endpoint is cross-region.

For data plane access to Kinesis for a customer data stream in your VPC region:

  • This traffic uses the Kinesis VPC endpoint.

Set up regional endpoints

This section describes how to set up regional endpoints, with optional steps for restricting outbound access and adding Spark configurations at the notebook or cluster level.

Note

Databricks generally recommends that you perform most of these steps during initial VPC setup. However, the final step that discusses adding Spark configurations at the notebook or cluster level must be performed after workspace creation, so remember to complete those steps later. Alternatively, you can choose to perform all of these steps after workspace creation and initial testing.

Some steps specify actions that must be repeated for each subnet associated with your workspace.

Set up the S3 VPC gateway endpoint

You can use either the AWS console or AWS CLI tool.

The CLI command has the following form:

aws ec2 create-vpc-endpoint \
  --vpc-id <vpc-id> \
  --service-name com.amazonaws.<region>.s3 \
  --route-table-ids <workspace-route-table-id>

In the AWS console, your route table looks like the following:

S3 regional endpoint

Set up the STS VPC gateway endpoint

You can use either the AWS console or AWS CLI tool.

Important

You must repeat this step for each subnet associated with your workspace.

The CLI command has the following form:

aws ec2 create-vpc-endpoint \
  --vpc-id <aws-vpc-id> \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.<aws-region>.sts \
  --subnet-ids <aws-subnet-ids-list> \
  --security-group-ids <aws-security-group-id-list>

For the AWS console, follow these steps:

  1. Navigate to https://console.aws.amazon.com/vpc.

  2. In the left navigation, click Endpoints, then click the Create Endpoint button.

  3. In the Service Name field, search for sts and select the one in your region.

  4. In the VPC field, choose the VPC associated with the Databricks deployment.

  5. In that same field, select your workspace’s subnets. For a production deployment, your subnets for the endpoint network interface must be different than your workspace’s subnets.

    Important

    AWS creates an endpoint network interface in each subnet. In this step, select only the subnets associated with your Databricks workspace.

    STS regional endpoint part 1
  6. Select the Enable DNS Name checkbox.

  7. Ensure that your VPC has DNS hostnames and DNS resolution enabled.

    In a separate browser tab, navigate to https://console.aws.amazon.com/vpc/. Click Your VPCs, then select the VPC associated with the Databricks deployment. In the Description tab, confirm that these two features are enabled. If either is disabled, click Actions and select either Edit DNS Resolution or Edit DNS Hostnames.

    See the AWS documentation for details.

  8. In the Security Group field, select the security group associated with your Databricks workspace.

  9. In the Policy field, select Full Access.

    STS regional endpoint part 2
  10. Because the STS interface endpoints do not appear in the route table, confirm that the STS interface endpoint appears in the endpoints section of the AWS VPC console. Check the Subnets, Security Groups, and Policy tabs for correct configuration:

    STS regional endpoint part 3

(Optional) Restrict outbound access

Warning

If you perform this optional step to restrict outbound access, there are limitations. Be sure that you understand them before continuing:

  • DBFS Fuse V2 will not work, such as typing %sh ls /dbfs/ in a notebook. However, the dbutils tools work.
  • There will be no cross-region access to any AWS services. Typically, this is a limitation only if you need access to S3 buckets in other AWS regions.
  1. Explicitly allow connectivity for the web application, secure cluster connectivity relay, and metastore endpoints in your egress routing infrastructure that may be comprised of a NAT Gateway or third-party appliance with internet connectivity. You can find the URLs in this article in the section Firewall appliance infrastructure (for secure cluster connectivity only).

    Important

    Databricks strongly recommends that you specify destinations as domain names in your egress infrastructure, rather than as IP addresses.

  2. Remove the destination 0.0.0.0/0 with the NAT Gateway target. You might not have this destination, depending on how you created your VPC. If you do not have it, skip this step.

(Optional) Access S3 mounts using instance profiles

To access S3 mounts using instance profiles, set the following Spark configurations. You can set these at either the notebook or cluster level:

sc.hadoopConfiguration.set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com"))
sc.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")

Note

Adding a Spark configuration at a notebook or cluster level is not enforced automatically across your workspace. To enable a Spark configuration consistently across your workspace, create a cluster policy.

Warning

For the S3 service, there are limitations to applying additional regional endpoint configurations at the notebook or cluster level. Notably, access to cross-region S3 access is blocked, even if the global S3 URL is allowed in your egress firewall or proxy. If your Databricks deployment might require cross-region S3 access, it is important that you not apply the Spark configuration at the notebook or cluster level.