• Databricks
  • Databricks
  • Support
  • Feedback
  • Try Databricks
  • Help Center
  • Documentation
  • Knowledge Base
Databricks on AWS

Getting started

  • Introduction
  • Get started
  • Tutorials and best practices

User guides

  • Data Science & Engineering
  • Machine Learning
  • Databricks SQL
  • Data
  • Delta Lake
  • Developer tools
  • Integrations

Administration guides

  • Accounts and workspaces
    • Administration overview
    • Manage your Databricks account
    • Manage your Databricks account (legacy)
    • Access the admin console
    • Manage users and groups
    • Enable Databricks SQL for users and groups
    • Enable access control
    • Manage workspace objects and behavior
    • Manage cluster configuration options
    • Manage AWS infrastructure
      • Get EC2 instance information
      • Secure access to S3 buckets using instance profiles
      • Secure access to S3 buckets across accounts using instance profiles with an AssumeRole policy
      • Secure access to Kinesis across accounts using instance profiles with an AssumeRole policy
      • Set up AWS authentication for SageMaker deployment
      • Proxy traffic through a NAT gateway
      • VPC peering
      • Customer-managed VPC
      • Enable AWS PrivateLink
        • Overview
        • Requirements
        • Step 1: Contact your Databricks representative to enable PrivateLink on your account
        • Step 2: Create VPC endpoints using AWS console
          • VPC endpoint requirements
          • Create the AWS VPC endpoints for your AWS region
        • Step 3: Register your VPC endpoint IDs with the Account API
        • Step 4: Enable private DNS names on AWS VPC endpoints using the AWS console
        • Step 5: Create a private access settings configuration using the Databricks Account API
        • Step 6: Create a new network configuration using Account API
        • Step 7: Create or update a workspace with PrivateLink configurations using the Account API
          • Create a workspace
          • Update a workspace
        • Step 8: Configure the internal DNS (required only for front-end connections)
        • Step 9: Add VPC endpoints for other AWS services (recommended but optional)
      • AWS firewall restrictions
      • AWS configuration resources
      • Serverless compute
      • Supported Databricks clouds and regions
      • Configure Databricks S3 commit service-related settings
    • Disaster recovery
    • Databricks access to customer workspaces using Genie
  • Security
  • Data governance
  • Data sharing

Reference guides

  • API reference
  • SQL reference
  • CLI and utilities

Resources

  • Release notes
  • Other resources

Updated May 13, 2022

Send us feedback

  • Documentation
  • Administration guide
  • Manage AWS infrastructure
  • Enable AWS PrivateLink

Enable AWS PrivateLink

Preview

This feature is in Public Preview.

This article explains how to use AWS PrivateLink to enable private connectivity between users and their Databricks workspaces and between clusters on the data plane and core services on the control plane within the Databricks workspace infrastructure.

Important

This article mentions the term data plane, which is the compute layer of the Databricks platform. In the context of this article, data plane refers to the Classic data plane in your AWS account. By contrast, the Serverless data plane that supports Serverless SQL endpoints (Public Preview) runs in the Databricks AWS account. To learn more, see Serverless compute.

Overview

AWS PrivateLink provides private connectivity from AWS VPCs and on-premise networks to AWS services without exposing the traffic to the public network. Databricks workspaces on the E2 version of the platform support PrivateLink connections for two connection types:

  • Front-end (user to workspace): A front-end PrivateLink connection allows users to connect to the Databricks web application, REST API, and Databricks Connect API over a VPC interface endpoint.

  • Back-end (data plane to control plane): Databricks Runtime clusters in a customer-managed VPC (the data plane) connect to a Databricks workspace’s core services (the control plane) in the Databricks cloud account. Clusters connect to the control plane for two destinations: REST APIs (such as the Secrets API) and the secure cluster connectivity relay. This PrivateLink connection type involves two different VPC interface endpoints because of the two different destination services.

You can implement both front-end and back-end PrivateLink connections or just one of them. This article discusses how to configure either one or both PrivateLink connection types. If you implement PrivateLink for both the front-end and back-end connections, you can optionally mandate private connectivity for the workspace, which means Databricks rejects any connections over the public network. If you decline to implement any one of these connection types, you cannot enforce this requirement.

To enable PrivateLink connections, you must create Databricks configuration objects and add new fields to existing configuration objects. Use the Account API 2.0 to register and update objects.

You can set up PrivateLink for a new or existing workspace on the E2 version of the platform that uses secure cluster connectivity.

Note

You cannot use the Account API to set up PrivateLink to update an existing workspace configuration in this release. Contact your Databricks representative to perform this step for you.

The following table describes important terminology.

Terminology

Description

AWS PrivateLink

An AWS technology that provides private connectivity from AWS VPCs and on-premise networks to AWS services without exposing the traffic to the public network.

AWS VPC endpoint service

An AWS VPC endpoint service is a PrivateLink-powered service. Each Databricks control plane (typically one per region) publishes two AWS VPC endpoint services for PrivateLink. The workspace VPC endpoint service applies to both a Databricks front-end PrivateLink connection or the Databricks back-end PrivateLink connection for REST APIs. Databricks publishes another VPC endpoint service for its secure cluster connectivity relay.

AWS VPC endpoint

An AWS VPC interface endpoint enables private connections between your VPC and VPC endpoint services powered by AWS PrivateLink. You must create AWS VPC interface endpoints and then register them with Databricks using Account API 2.0. Registering a VPC endpoint creates a Databricks-specific object that references the VPC endpoint. See Step 3: Register your VPC endpoint IDs with the Account API.

Databricks network configuration

A Databricks object that describes the important information about a Customer-managed VPC. If you implement any PrivateLink connection (front-end or back-end), your workspace must use a customer-managed VPC. For PrivateLink back-end support only, your network configuration needs an extra property that identifies the VPC endpoints for the back-end connection. See Step 7: Create or update a workspace with PrivateLink configurations using the Account API.

Databricks private access settings object

A Databricks object that describes a workspace’s PrivateLink connectivity. You create a private access settings object using Account API 2.0 and link its ID in the workspace configuration when you create a workspace. You can update this ID later, but to use PrivateLink you must attach a private access settings object to the workspace during workspace creation.

Databricks workspace configuration objects

A Databricks object that describes a workspace. You create a workspace using the Account API 2.0. This object must include an ID for a Databricks personal access settings object.

The following diagram shows the network flow in a typical implementation.

PrivateLink network architecture

Requirements

Databricks account:

  • Your Databricks account is on the E2 version of the platform.

  • Your Databricks account is on the Enterprise pricing tier.

  • You have your Databricks account ID. Get your account ID from the account console.

Databricks workspace:

  • Your workspace must be in an AWS region that supports the E2 version of the platform. However, the us-west-1 region does not support PrivateLink even for workspaces on the E2 version of the platform.

  • Your Databricks workspace must use Customer-managed VPC to add any PrivateLink connection (even a front-end-only connection). Note that you cannot update an existing workspace with a Databricks-managed VPC and change it to use a customer-managed VPC.

  • If you implement the back-end PrivateLink connection, your Databricks workspace must use Secure cluster connectivity, which is the default for new workspaces on the E2 version of the platform. However, to add back-end PrivateLink to an older existing workspace that does not yet use secure cluster connectivity, contact your Databricks representative for guidance.

Network architecture:

  • If you implement the front-end PrivateLink connection to access the workspace from your on-premise network, you must add private connectivity from that network to your AWS network using Direct Connect or VPN before configuring PrivateLink.

AWS account permissions:

  • If you are the user who sets up PrivateLink, you must have all necessary AWS permissions to provision a Databricks workspace and to provision new VPC endpoints for your workspace.

Step 1: Contact your Databricks representative to enable PrivateLink on your account

PrivateLink support is not enabled by default for Databricks accounts. You must contact your Databricks representative to request enablement of your Databricks account and relevant AWS accounts to configure VPC endpoints to Databricks endpoint services.

Provide your Databricks representative with the AWS account ID and region where you plan to create one or more VPC endpoints for the Databricks workspaces. If you have multiple AWS accounts that would host Databricks VPC endpoints, request PrivateLink access for all of them. Be sure that you have all AWS account IDs before you contact your Databricks representative.

You must request enablement of your account for PrivateLink before you continue with the remaining configuration steps.

Step 2: Create VPC endpoints using AWS console

You must create VPC endpoints to Databricks endpoint services in VPCs that belong to your AWS accounts that were enabled for PrivateLink in the previous step. The VPC endpoints connect to your workspace’s front-end and back-end interfaces.

VPC endpoint requirements

Front-end connection networking requirements

Front-end connections are the connections from the user to the workspace front-end interfaces for the web application, REST APIs, and Databricks Connect.

For VPC and subnets, use an AWS VPC and subnets that are reachable from the user environment. Use the transit (bastion) VPC that terminates your AWS Direct Connect or VPN gateway connection or one that is routable from such a transit (bastion) VPC.

Create a new security group for the endpoint. The security group must allow HTTPS (port 443) access bidirectionally (from and to) for both the source network and the endpoint subnet itself.

Back-end connection networking requirements

Back-end connections are the connections from the data plane (cluster nodes) to the workspace back-end interfaces. It is important to understand that for a back-end connection, there are two separate VPC endpoints because the data plane needs to connect to both of the following services in the control plane:

  • Databricks REST APIs

  • Databricks secure cluster connectivity relay

For the back-end REST API VPC endpoint:

  • Host the VPC endpoint in your workspace’s main customer-managed VPC.

  • Create the VPC endpoint in a separate subnet that is not used by your workspace. Share this subnet between your VPC endpoint for REST APIs and your VPC endpoint for the secure cluster connectivity relay. Attach a separate route table to your subnet, which would be different from the route table attached to your workspace subnets. The route table for a VPC endpoint’s subnet needs only a single default route for the local VPC.

  • If you enable PrivateLink for front-end connections, it is possible to reuse that front-end VPC endpoint as long as it is network accessible from the workspace subnets. Otherwise, create a new VPC endpoint from the workspace’s VPC to the same VPC endpoint service.

  • Create a separate security group for the endpoint that allows HTTPS/443 and TCP/6666 with bidirectional access (from and to) both the workspace subnets and the endpoint subnet itself. This configuration allows access for both REST APIs (port 443) and secure cluster connectivity (6666) and makes it easy to share the security group for both purposes.

For the back-end secure cluster connectivity relay VPC endpoint:

  • Host the VPC endpoint in your workspace’s main customer-managed VPC.

  • Create the VPC endpoint in a separate subnet that is not used by your workspace. Share this subnet between your VPC endpoint for REST APIs (if not reusing the front-end VPC endpoint) and your VPC endpoint for the secure cluster connectivity relay.

  • Create a separate security group for the endpoint that allows HTTPS/443 and TCP/6666 with bidirectional access from and to both the workspace subnets and the endpoint subnet itself. This configuration allows access for both REST APIs (port 443) and secure cluster connectivity (6666) and makes it easy to share the security group for both purposes.

Create the AWS VPC endpoints for your AWS region

To create the AWS VPC endpoints with the AWS Console, see the AWS article that describes how to create VPC endpoints in the AWS console. For tools that can help automate creating and managing VPC endpoints, see the AWS articles CloudFormation: Creating VPC Endpoint and AWS CLI: create-vpc-endpoint.

Important

Be careful to copy the correct endpoint for your AWS region by using the table in this section. It is a common configuration error to copy the wrong endpoints on this configuration step.

Select the Enable DNS Hostnames and DNS Resolution options at the VPC level for both types of VPC endpoints. You can select Enable DNS Name on individual VPC endpoints after you’ve registered your VPC endpoints with Databricks, as described in Step 3: Register your VPC endpoint IDs with the Account API. Until then, the VPC endpoints remain in the pendingAcceptance state.

You can share the VPC endpoints across multiple workspaces. Whether you do depends on your organization’s AWS architecture best practices and your overall throughput requirements across all workloads. If you decide to share those across workspaces, you must create the back-end VPC endpoints in a separate subnet that is routable from the subnets of all the workspaces.

A front-end endpoint originates in a VPC with user access, which is typically a transit (bastion) VPC connected to an on-premise network. This is generally a separate VPC from the workspace’s data plane VPC. Although the Databricks VPC endpoint service is the same shared service for the front-end connection and the back-end REST API connection, in typical implementations, connections originate from two separate VPCs and thus need separate AWS VPC endpoints that originate in each VPC.

Use the following table to determine your region’s VPC endpoint service domains. You can use any availability zone in your region.

It is important to understand that the VPC endpoint service identified in the table as the workspace VPC endpoint service is used for both the front-end connection (user-to-workspace for web application and REST APIs) and the back-end connection to the REST APIs. If you are implementing both front-end and back-end connections, you will use this same workspace VPC endpoint service for both use cases.

Region

Configuration

us-east-1

Workspace VPC endpoint service: com.amazonaws.vpce.us-east-1.vpce-svc-09143d1e626de2f04

Back-end SCC relay service: com.amazonaws.vpce.us-east-1.vpce-svc-00018a8c3ff62ffdf

us-east-2

Workspace VPC endpoint service: com.amazonaws.vpce.us-east-2.vpce-svc-041dc2b4d7796b8d3

Back-end SCC relay service: com.amazonaws.vpce.us-east-2.vpce-svc-090a8fab0d73e39a6

us-west-1

PrivateLink connectivity is unsupported for this region.

us-west-2

Workspace VPC endpoint service: com.amazonaws.vpce.us-west-2.vpce-svc-0129f463fcfbc46c5

Back-end SCC relay service: com.amazonaws.vpce.us-west-2.vpce-svc-0158114c0c730c3bb

eu-west-1

Workspace VPC endpoint service: com.amazonaws.vpce.eu-west-1.vpce-svc-0da6ebf1461278016

Back-end SCC relay service: com.amazonaws.vpce.eu-west-1.vpce-svc-09b4eb2bc775f4e8c

eu-west-2

Workspace VPC endpoint service: com.amazonaws.vpce.eu-west-2.vpce-svc-01148c7cdc1d1326c

Back-end SCC relay service: com.amazonaws.vpce.eu-west-2.vpce-svc-05279412bf5353a45

eu-central-1

Workspace VPC endpoint service: com.amazonaws.vpce.eu-central-1.vpce-svc-081f78503812597f7

Back-end SCC relay service: com.amazonaws.vpce.eu-central-1.vpce-svc-08e5dfca9572c85c4

ap-southeast-1

Workspace VPC endpoint service: com.amazonaws.vpce.ap-southeast-1.vpce-svc-02535b257fc253ff4

Back-end SCC relay service: com.amazonaws.vpce.ap-southeast-1.vpce-svc-0557367c6fc1a0c5c

ap-southeast-2

Workspace VPC endpoint service: com.amazonaws.vpce.ap-southeast-2.vpce-svc-0b87155ddd6954974

Back-end SCC relay service: com.amazonaws.vpce.ap-southeast-2.vpce-svc-0b4a72e8f825495f6

ap-northeast-1

Workspace VPC endpoint service: com.amazonaws.vpce.ap-northeast-1.vpce-svc-02691fd610d24fd64

Back-end SCC relay service: com.amazonaws.vpce.ap-northeast-1.vpce-svc-02aa633bda3edbec0

ap-south-1

Workspace VPC endpoint service: com.amazonaws.vpce.ap-south-1.vpce-svc-0dbfe5d9ee18d6411

Back-end SCC relay service: com.amazonaws.vpce.ap-south-1.vpce-svc-03fd4d9b61414f3de

ca-central-1

Workspace VPC endpoint service: com.amazonaws.vpce.ca-central-1.vpce-svc-0205f197ec0e28d65

Back-end SCC relay service: com.amazonaws.vpce.ca-central-1.vpce-svc-0c4e25bdbcbfbb684

Step 3: Register your VPC endpoint IDs with the Account API

Using the Account API 2.0, register the VPC endpoint IDs and the region from the VPC endpoints that you created in the previous step. This step creates a Databricks-specific VPC endpoint object for every AWS VPC endpoint that you created in the previous step.

  • For the front-end connection, you register the AWS VPC endpoint that you created to the workspace in the previous step.

  • For the back-end connection, register the AWS VPC endpoint that you created to the secure cluster connectivity relay. Also register the AWS VPC endpoint that you created in the workspace data plane VPC for the back-end connection to REST APIs. If you reuse the same VPC endpoint for both the front-end connection and the back-end connection to the REST APIs (which requires that your user access be from a network that is reachable from your workspace VPC), you won’t need to register it twice.

Important

In this release, when you register a VPC endpoint to the Databricks workspace VPC endpoint service for either a front-end connection or back-end REST API connection for any workspace, Databricks enables front-end (web application and REST API) access from that VPC endpoint to all PrivateLink-enabled workspaces in your Databricks account in that AWS region.

To register a VPC endpoint in Databricks, make a POST request to the /accounts/<account-id>/vpc-endpoints REST API endpoint and pass it the following fields in the request body:

  • vpc_endpoint_name: User-visible name for the VPC endpoint configuration within Databricks.

  • region: AWS region name

  • aws_vpc_endpoint_id: Your VPC endpoint’s ID within AWS. It starts with prefix vpce-.

For example:

curl -X POST -n \
  'https://accounts.cloud.databricks.com/api/2.0/accounts/<account-id>/vpc-endpoints' \
  -d '{
  "vpc_endpoint_name": "Databricks front-end endpoint",
  "region": "us-west-2",
  "aws_vpc_endpoint_id": "<vpce-id>"
}'

If successful, your new VPC endpoint configuration progresses to the available state in AWS after going through some intermediate states.

The response JSON includes a vpc_endpoint_id field. If you are adding a back-end PrivateLink connection, save this value. This ID is specific to this configuration within Databricks. You need this ID when you create the network configuration in a later step (Step 6: Create a new network configuration using Account API).

After you register all VPC endpoints, you must enable a DNS name for them, as described in Step 4: Enable private DNS names on AWS VPC endpoints using the AWS console.

Related Account API operations that may be useful:

  • Check the state of a VPC endpoint — The state field in the response JSON indicates the state within AWS.

  • Get all VPC endpoints in your account

  • Delete VPC endpoint registration

Step 4: Enable private DNS names on AWS VPC endpoints using the AWS console

Using the AWS console, enable private DNS names on your VPC endpoints.

After you’ve registered your VPC endpoints with the Account API and the state in AWS has progressed to Available, select Enable DNS Name on each VPC endpoint.

  1. In AWS console, click VPC > Endpoints.

  2. Select your VPC endpoint.

  3. Click the Actions button at the top.

  4. Select Modify Private DNS Names. If this setting is grey (unavailable), the endpoint may not yet be in the available state.

  5. Select Enable Private DNS Name and click Modify Private DNS Names.

  6. Wait until AWS shows the VPC endpoint state as available.

To automate enablement of Private DNS Name on VPC endpoints, see this AWS article:

  • AWS CLI : modify-vpc-endpoint

Step 5: Create a private access settings configuration using the Databricks Account API

Use the Databricks Account API 2.0 to create or attach a private access settings configuration object.

The private access settings object supports the following scenarios:

  • Implement only a front-end VPC endpoint

  • Implement only a back-end VPC endpoint

  • Implement both front-end and back-end VPC endpoints

For a workspace to support any of these PrivateLink connectivity scenarios, the workspace must be created with an attached private access settings object. This can be a new private access settings object intended only for this workspace. Alternatively, you can re-use an existing private access setting object to share it among multiple workspaces in the same AWS region.

This object serves two purposes:

  1. Expresses your intent to use AWS PrivateLink with your workspace. If you intend to connect to your workspace using front-end or back-end PrivateLink, you must attach one of these objects to your workspace during workspace creation.

  2. Controls your settings for the front-end use case of AWS PrivateLink. If you wish to use back-end PrivateLink only, you can choose to set the object’s public_access_enabled field to true.

In the private access settings object definition, the public_access_enabled configures public access to the front-end connection (the web application and REST APIs) for your workspace:

  • If set to false (the default), the front-end connection can be accessed only using PrivateLink connectivity and not from the public internet. Because access from the public network is disallowed in this case, the IP access lists feature is not supported for the workspace.

  • If set to true, the front-end connection can be accessed either from PrivateLink connectivity or from the public internet. You can optionally configure an IP access list for the workspace to restrict the source networks that could access the web application and REST APIs from the public internet.

To create a private access settings object, make a POST request to the /accounts/<account-id>/private-access-settings REST API endpoint. The request body must include the following properties:

  • private_access_settings_name: Human-readable name for the private access settings object.

  • region: AWS region name.

  • public_access_enabled: Specifies whether to enable public access for the front-end connection. If true, public access is possible for the front-end connection in addition to PrivateLink connections. See the previous table for the required value for your implementation.

  • private_access_level: (Public Preview) Specify which VPC endpoints can connect to this workspace:

    • ANY (the default): Perform no filtering of the VPC endpoints that can connect to your workspace using AWS PrivateLink.

    • ACCOUNT: Limit connections to those VPC endpoints that are registered in your Databricks account.

    • ENDPOINT: Limit connections to an explicit set of VPC endpoints. See the related allowed_vpc_endpoint_ids property.

  • allowed_vpc_endpoint_ids: Use only if private_access_level is set to ENDPOINT. This property specifies the set of VPC endpoints that can connect to this workspace. Specify as a JSON array of VPC endpoint IDs. Use the Databricks IDs that were returned during endpoint registration, not the AWS IDs.

curl -X POST -n \
  'https://accounts.cloud.databricks.com/api/2.0/accounts/<account-id>/private-access-settings' \
  -d '{
  "private_access_settings_name": "Default PAS for us-west-2",
  "region": "us-west-2",
  "public_access_enabled": true
}'

The response JSON includes a private_access_settings_id field. This ID is specific to this configuration within Databricks. It is important that you save that result field because you will need it when you create the workspace.

Related APIs:

  • Get a private access settings object by its ID

  • Get all private access settings objects

  • Update a private access settings object (Public Preview)

  • Delete private access settings object

Step 6: Create a new network configuration using Account API

Note

If you implement only the front-end connection, skip this step. Although you must create a network configuration because a customer-managed VPC is required, there are no PrivateLink changes to this object if you only implement a front-end PrivateLink connection.

For any PrivateLink support, you must use a Customer-managed VPC. This feature requires you to create a network configuration object that encapsulates the ID for the VPC, the subnets, and the security groups.

For back-end PrivateLink support, your network configuration must have an extra field that is specific to PrivateLink. The network configuration vpc_endpoints field references your Databricks-specific VPC endpoint IDs that were returned when you registered your VPC endpoints. See Step 3: Register your VPC endpoint IDs with the Account API.

Add both of these fields in that object:

  • rest_api: Set this to a JSON array that includes exactly one element: the Databricks-specific ID for the back-end REST API VPC endpoint that you registered in Step 3: Register your VPC endpoint IDs with the Account API.

    Important

    Be careful to use the Databricks-specific ID that was created when you registered the regional endpoint based on the table in Create the AWS VPC endpoints for your AWS region. It is a common configuration error to set the wrong ID on this field.

  • dataplane_relay: Set this to a JSON array that includes exactly one element: the Databricks-specific ID for the back-end SCC VPC endpoint that you registered in Step 3: Register your VPC endpoint IDs with the Account API.

    Important

    Be careful to use the Databricks-specific ID that was created when you registered the regional endpoint based on the table in Create the AWS VPC endpoints for your AWS region. It is a common configuration error to set the wrong ID on this field.

You get these Databricks-specific VPC endpoint IDs from the JSON responses of the requests you made in Step 3: Register your VPC endpoint IDs with the Account API, within the vpc_endpoint_id response field.

The following example creates a new network configuration that references the VPC endpoint IDs. Replace <databricks-vpce-id-for-scc> with your Databricks-specific VPC endpoint ID for the secure cluster connectivity relay. Replace <databricks-vpce-id-for-rest-apis> with your Databricks-specific VPC endpoint ID for the REST APIs.

curl -X POST -n \
  'https://accounts.cloud.databricks.com/api/2.0/accounts/<account-id>/networks' \
  -d '{
    "network_name": "Provide name for the Network configuration",
    "vpc_id": "<aws-vpc-id>",
    "subnet_ids": [
        "<aws-subnet-1-id>",
        "<aws-subnet-2-id>"
    ],
    "security_group_ids": [
        "<aws-sg-id>"
    ],
    "vpc_endpoints": {
        "dataplane_relay": [
            "<databricks-vpce-id-for-scc>"
        ],
        "rest_api": [
            "<databricks-vpce-id-for-rest-apis>"
        ]
    }
}'

Step 7: Create or update a workspace with PrivateLink configurations using the Account API

Use the Databricks Account API 2.0 to create or update a new workspace. A workspace configuration object includes IDs for other configuration objects that you’ve created. For the workspace configuration object, there are two changes for PrivateLink:

  • Set the private_access_settings_id to the ID of your private access settings object that you created. See Step 5: Create a private access settings configuration using the Databricks Account API.

  • Pass the ID for your new network configuration that you created in Step 6: Create a new network configuration using Account API.

Create a workspace

To create a new workspace with PrivateLink, call the Create a new workspace API (POST /accounts/{account_id}/workspaces).

For complete instructions, see Create a new workspace using the Account API.

For example:

curl -X POST -n \
  'https://accounts.cloud.databricks.com/api/2.0/accounts/<databricks-account-id>/workspaces' \
  -d '{
  "workspace_name": "my-company-example",
  "deployment_name": "my-company-example",
  "aws_region": "us-west-2",
  "credentials_id": "<aws-credentials-id>",
  "storage_configuration_id": "<databricks-storage-config-id>",
  "network_id": "<databricks-network-id>",
  "managed_services_customer_managed_key_id": "<aws-kms-managed-services-key-id>",
  "storage_customer_managed_key_id": "<aws-kms-notebook-workspace-storage-id>",
  "private_access_settings_id": "<private-access-settings-id>"
}'

For a list of tools that can help you to automate workspace creation, see Use automation templates to create a new workspace using the Account API.

Update a workspace

You cannot update your PrivateLink objects in your workspace configuration using the Account API in this release (see Step 7: Create or update a workspace with PrivateLink configurations using the Account API). Contact your Databricks representative to do this step for you.

Step 8: Configure the internal DNS (required only for front-end connections)

Note

If you implement only the back-end connection, skip this step.

For the front-end connection, if users will access the Databricks workspace from an on-premise network that is under the scope of your internal or custom DNS, you must perform the following configuration after the workspace is created or updated to ensure that your workspace URL maps to the VPC endpoint private IP for your workspace VPC endpoint.

Configure your internal DNS such that it maps workspace URL to the VPC endpoint for the front-end VPC endpoint.

By default, the public DNS mapping for a workspace with front-end VPC endpoint in AWS region us-east-1 looks like the following:

  • test-workspace.cloud.databricks.com maps to nvirginia.privatelink.cloud.databricks.com

  • nvirginia.privatelink.cloud.databricks.com maps to nvirginia.cloud.databricks.com

  • nvirginia.cloud.databricks.com maps to the AWS public IPs

If you access the workspace from a virtual machine in the same VPC as your front-end VPC endpoint, the DNS mapping would look like the following:

  • test-workspace.cloud.databricks.com maps to nvirginia.privatelink.cloud.databricks.com

  • nvirginia.privatelink.cloud.databricks.com maps to the VPC endpoint private IP

Use the nslookup Unix command line tool to test the DNS resolution.

For the workspace URL to map to the VPC endpoint private IP from the on-premise network, you must do one of the following:

  • Configure conditional forwarding for the workspace URL to use AmazonDNS

  • Create an A-record for the workspace URL in your on-premise or internal DNS that maps to the VPC endpoint private IP.

  • Complete steps similar to what you would do to enable access to other similar PrivateLink-enabled services.

You can choose to map the workspace URL directly to the front-end VPC endpoint private IP by creating an A-record in your internal DNS, such that the DNS mapping looks like this:

  • test-workspace.cloud.databricks.com maps to the VPC endpoint private IP

After you make changes to your internal DNS configuration, you should be able to access the Databricks workspace web application and REST API.

If you have questions about how this applies to your network architecture, contact your Databricks representative.

Step 9: Add VPC endpoints for other AWS services (recommended but optional)

If you are using Secure cluster connectivity, which is required to implement a PrivateLink back-end connection, you can optionally add other VPC endpoints to your data plane VPC so your clusters can use them to connect to AWS native services that Databricks uses (S3, STS, and Kinesis) and other resources that your workspace will access.

  • S3 VPC gateway endpoint: Attach this only to the route table that’s attached to your workspace subnets. If you’re using the recommended separate subnet with its own route table for back-end VPC endpoints, then the S3 VPC endpoint doesn’t need to be attached to that particular route table.

  • STS VPC interface endpoint: Create this in all the workspace subnets and attach it to the workspace security group. Do not create this in the subnet for back-end VPC endpoints.

  • Kinesis VPC interface endpoint - Just like the STS VPC interface endpoint, create the Kinesis VPC interface endpoint in all the workspace subnets and attach them to the workspace security group.

Even if you don’t have a fully outbound-locked-down deployment, you can add some or all of the above endpoints so that the relevant traffic traverses the AWS network backbone rather than the public network.

If you want to lock down a workspace VPC so that no other outbound connections are supported, the workspace won’t have access to the Databricks-provided metastore (RDS-based Hive Metastore) because AWS does not yet support PrivateLink for JDBC traffic to RDS. One option is that you could configure the regional Databricks-provided metastore FQDN or IP in an egress firewall, or a public route table to Internet Gateway, or a Network ACL for the public subnet hosting a NAT Gateway. In such a case, the traffic to the Databricks-provided metastore would go over the public network. However, if you do not want to access the Databricks-managed metastore over the public network:

  • You could deploy an external metastore in your own VPC. See External Apache Hive metastore.

  • You could use AWS Glue for your metastore. Glue supports PrivateLink. See Use AWS Glue Data Catalog as the metastore for Databricks Runtime.

You may also want to consider any need for access to public library repositories like pypi (for python) or CRAN (for R). To access those, either reconsider deploying in a fully-locked-down outbound mode, or instead use an egress firewall in your architecture to configure the required repositories. The overall architecture of your deployment depends on your overall requirements. If you have questions, contact your Databricks representative.

To create the AWS VPC endpoints using the AWS Console, see the AWS article for creating VPC endpoints in the AWS console.

For tools that can help automate VPC endpoint creation and management, see:

  • The article Databricks Terraform provider

  • The Terraform resource databricks_mws_vpc_endpoint

  • The AWS article CloudFormation: Creating VPC Endpoint

  • The AWS article AWS CLI : create-vpc-endpoint.


© Databricks 2022. All rights reserved. Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation.

Send us feedback | Privacy Policy | Terms of Use