This guide illustrates how to configure your Databricks deployment in AWS so that specific traffic between EC2 instances and another IP address is proxied through a NAT gateway. You might want to do this to point your BI tools to a static IP address. For example, this can proxy all traffic to a Redshift cluster through the NAT gateway, so from the point of view of the Redshift cluster, all instances have a stable public IP address.
For this to work with Redshift, the Redshift cluster must be publicly accessible with a stable IPv4 address. This can be achieved by launching the Redshift cluster with an elastic IP.
This article walks through allocating a new public subnet in your Databricks VPC, adding a NAT gateway inside the subnet, and updating the default route table to make sure specific traffic goes through the NAT gateway.
Contact Databricks to retrieve the ID of your Databricks VPC. You will use this Databricks VPC ID several times in the following steps.
Log in to your AWS VPC console and select Subnet from the left panel.
Type the Databricks VPC ID in the search box. You can see all subnets that Databricks created in the VPC.
By default, Databricks creates one subnet per availability zone (AZ). You must choose an unused CIDR block from the same B-class range (i.e.,
a.b.0.0). In the example, there are 3 used CIDR blocks:
10.29.160.0/19. The address range
10.29.159.255are all free. You can choose a new C-class subnet within that range, say
- Databricks allocates CIDR blocks starting from the top of the B-class range. When choosing the CIDR block for the gateway subnet, it is better to use a lower range to avoid future conflict in case AWS adds more availability zones to the region.
- The size of the public subnet can be arbitrarily small — it just needs to accommodate a single host for the NAT gateway.
Click Create Subnet and type in all necessary information.
- Name tag: any name, suggest to have a
- VPC: choose Databricks VPC-ID.
- Availability Zone: choose any one from the list
- CIDR block: an unused CIDR block (for example,
- Name tag: any name, suggest to have a
Click Yes, Create.
Select NAT gateways from AWS VPC console’s left panel, and click Create NAT gateway.
Choose the subnet that you just created in Step 1 from the Subnet field, and choose any available IP from Elastic IP Allocation ID, and then click Create a NAT gateway. If there are no available elastic IPs, you may need to create a new elastic IP for the NAT gateway to use.
On the success page, click Edit Route Tables.
Click Create route table, enter any name in the Name tag field.
Select the Databricks VPC-ID from the VPC field.
Click Yes, Create.
Select the newly-created route table from the list, go to the Routes tab, click Edit and Add another route.
0.0.0.0/0in Destination, and choose Databricks VPC’s internet gateway (starts with
igw-xxxxxx) from Target, and click Save.
Go to the Subnet Associations tab, edit and check the gateway subnet you created in Step 1, and click Save. This grants direct internet access for the gateway subnet, making it a “public subnet”.
This step forwards all traffic between VPC instances in the VPC and an external system (for example, Redshift), ensuring that all their internet traffic goes through the NAT gateway.
In the AWS VPC console’s left panel, select Route Tables and type the Databricks VPC-ID in the search box. The list shows all route tables used by Databricks VPC. One of them is the gateway route table you created in Step 3, the other (marked as Main) is the default route table for all other subnets in the VPC.
Select the route table marked as Main from the list, go to the Routes tab, and click Edit.
Find the Elastic IP address of the external system (for example, Redshift cluster). Here using
Add the Target for the external system
126.96.36.199/32to use the NAT gateway you created in Step 2 (starting with
nat-xxxxxxxxxx), and click Save.
Now all traffic between all instances in the VPC to the IP address you specified above will go through the NAT gateway. From the point of view of the external system, all nodes have the same IP address as the NAT gateway’s elastic IP. All other traffic will use the VPC’s original routing rules.
All external traffic that goes through the NAT gateway is subject to additional charges (see Amazon VPC pricing). To make sure that instances can access Databricks assets stored in S3 buckets faster and for free, you can add an S3 endpoint to the Databricks VPC.
VPC endpoints do not support cross-region access. Accessing S3 buckets that reside in another region will still go through the NAT gateway.
In the AWS VPC console’s left panel, select Endpoints and click Create Endpoint.
In the VPC field, select the Databricks VPC-ID, select the only available S3 service from the Service box, and click Next Step.
Check the main route table to associate the S3 endpoint with all private subnets and click Create Endpoint.
Go to the Subnets page and the Route Table tab, type the Databricks VPC-ID in the search box, and verify that all settings are correct.
The gateway subnet should have a default routing using the VPC’s internet gateway (
All other subnets should have a default routing using the NAT gateway (
If the S3 endpoint is configured correctly, a
vpce-xxxx <target>should show up in the list.
The VPC peering connection to the Databricks control plane VPC (
pcx-xxxxx) should remain untouched.
After you confirm that the NAT gateway proxy is set up properly, log in to your Databricks workspace and create a new cluster. Wait for the cluster to become ready. Contact Databricks support if the cluster cannot start.
With a NAT gateway proxy enabled, all outbound internet traffic from the Apache Spark cluster goes through the gateway. You can further increase security by removing the public-facing IP address on each worker instance.
You can only disable the public IP address on newly created AWS instances.
- Contact Databricks support and inform them that you would like to disable public IP addresses.
- After Databricks confirms that the configuration change is complete, terminate all clusters.
- Wait one hour or manually terminate the existing instances in the AWS console. If you do not perform this step, Databricks might reuse existing instances that still have the previously assigned public IP addresses for any new clusters.
- Log in to your Databricks deployment and create a new cluster. Wait for the cluster to become ready.
If all configurations are correct, the cluster should come up as usual. You can verify that the cluster has no public IP address by going to the AWS EC2 dashboard and checking the configuration of worker instances: