Skip to main content

Networking recommendations for Lakehouse Federation

This article provides guidance for setting up a viable network path between your Databricks clusters or SQL warehouses and the external database system that you are connecting to using Lakehouse Federation.

Bear the following important information in mind:

  • All network traffic is directly between Databricks clusters (or SQL warehouses) and the external database system. Neither Unity Catalog or the Databricks control plane are on the network path.
  • Databricks compute (that is, clusters and SQL warehouses) always deploys in the cloud, but the external database system can be on-premises or hosted on any cloud provider, as long as there's a viable network path between your Databricks compute and the external database.
  • If you have inbound or outbound network restrictions on either Databricks compute or the external database system, refer to the following sections for general guidance to help you create a viable network path.

For more information on networking in Databricks workspaces, see Networking.

Database system and Databricks compute both accessible from internet

The connection should work without any configuration.

Database system has network access restrictions

If the external database system has inbound or outbound network access restrictions and the Databricks cluster or SQL warehouse is accessible from the internet, configure one of the following network solutions to connect from classic compute resources:

  • Stable egress IP on Databricks compute.

    From the classic compute plane, set up a stable IP address with a load balancer, NAT gateway, internet gateway, or equivalent, and connect it to the subnet where Databricks compute is deployed. This allows the compute resource to share a stable public IP address that can be allowlisted on the external database side.

From serverless compute plane, stable egress IP is supported. See Step 1: Create a network connectivity configuration and copy the stable IPs.

The external database system should allowlist the Databricks compute stable IP for both ingress and egress traffic.

  • PrivateLink (only when the external database is on the same cloud as Databricks compute)

    From the classic compute plane, configure a PrivateLink connection between the network where the database is deployed and the network where Databricks compute is deployed.

Databricks compute has network access restrictions

If the external database system is accessible from the Internet and the Databricks compute has inbound or outbound network access restrictions (which is only possible if you are on a customer-managed network), then perform one of the following configurations:

  • Allowlist the hostname of the external database in the firewall rules of the subnet where Databricks compute is deployed.

    If you choose to allowlist the external database IP address rather than hostname, make sure that the external database has a stable IP address.

  • PrivateLink (only when the external database is on same cloud as Databricks compute)

    Configure a PrivateLink connection between the network where the database is deployed and the network where Databricks compute is deployed.

Databricks compute has a custom DNS server

If the external database system is accessible from the Internet and the Databricks compute has a custom DNS server (which is only possible if you are on a customer-managed network), add the database system's hostname to your custom DNS server so that it can be resolved.

AWS Glue networking considerations

If you use serverless compute with Glue federation, there's no setup required. If you use classic compute with Glue federation, Databricks recommends using private networking with a federated catalog for enhanced security and performance.

When you deploy Databricks with a federated catalog, networking between the Databricks data plane and the AWS Glue Data Catalog in your AWS VPC is essential. This typically involves establishing private connectivity between the Databricks workspace's VPC and the AWS VPC using AWS PrivateLink or an interface VPC endpoint.

An interface VPC endpoint in the AWS VPC acts as an entry point for traffic to the AWS Glue Data Catalog. It's associated with a security group that controls catalog access. The Databricks workspace is then configured to use this endpoint. Security groups and network ACLs must allow traffic on the required ports (typically 443). DNS resolution for the AWS Glue Data Catalog using the endpoint might need private DNS zones or a DNS forwarder. Ensuring high availability and monitoring network traffic are crucial for a resilient setup.

Federating to AWS Glue Catalog using NAT

Traffic to the AWS Glue Data Catalog can traverse the public internet, but private connectivity is recommended for security purposes. If you use serverless compute, networking to the Glue catalog is automatically routed to the public Glue endpoint glue.us-west-2.amazonaws.com. If the service credential has the correct IAM permissions, this works with no setup required.

Because NAT introduces extra cost and exposes traffic to the public internet, this is a fallback and not a best practice. If the Databricks compute and AWS Glue Data Catalog service endpoint are both on AWS in the same region, traffic stays on the AWS backbone rather than leaving for the open internet. However, it still resolves to a public IP instead of a private IP.

Snowflake networking considerations

Snowflake supports PrivateLink egress as a preview feature. Snowflake does not support providing static IPs, which prevents the practice of setting allowlists.