Skip to main content

Connect to on-premises resources using an SSH reverse tunnel

Connect your on-premises resources to Databricks without opening inbound firewall access. An on-premises tunnel host opens an outbound SSH connection to proxy virtual machines (VMs) in AWS, allowing Databricks classic and serverless compute to reach your on-premises resources.

How it works

An SSH reverse tunnel lets an on-premises tunnel host open outbound SSH connections to cloud proxy VMs in AWS. Databricks connects to the proxy VMs through a load balancer, and traffic flows back through the tunnel to the on-premises resource. The on-premises network requires only outbound SSH (port 22) to AWS, so no incoming ports are required.

In a reverse tunnel, the on-premises host initiates the connection outbound (on-premises to cloud) to avoid relaxing firewall restrictions, and the return traffic flows back (cloud to on-premises) over the established path.

Classic compute reaches the proxy VMs through peering. Serverless compute reaches them through a private endpoint connection using your cloud provider's private connectivity service.

note

This is a self-managed solution. You provision and maintain the proxy VMs and the on-premises tunnel host.

Required and optional components

note

This setup requires a dedicated network circuit between your cloud environment and your on-premises network. This circuit allows the on-premises tunnel host to initiate outbound SSH to the proxy VMs' private IP addresses. Common options include AWS Direct Connect or a VPN tunnel.

Required (minimum working setup):

  • An on-premises tunnel host running autossh to establish the outbound SSH connection.
  • One proxy VM in the cloud running socat to accept the tunnel and expose the resource port on its network interface.
  • A network path from Databricks to the proxy VM:
    • Classic compute: peering between the Databricks VPC and the proxy hub VPC.
    • Serverless compute: a private endpoint connection using your cloud provider's private connectivity service and a Network Connectivity Configuration (NCC) with a private endpoint rule.
    • Both compute types: configure both paths.

A single proxy VM without a load balancer is sufficient for development and testing.

Optional (adds high availability and production robustness):

  • Additional proxy VMs for redundancy.
  • A load balancer fronting the proxy VMs. Provides a stable endpoint and automatic failover when a tunnel fails.
  • An HTTP health check service on each proxy VM. This allows the load balancer to detect tunnel-level failures, not only VM or port availability.

Databricks recommends this configuration, but adapt it to your environment and security requirements.

Tunnel proxy hub

The proxy hub consists of the following components.

  • Proxy VMs (at least two for high availability). Each proxy VM runs three services:
    • sshd: The SSH daemon accepts incoming reverse tunnels from the on-premises tunnel host and places the tunnel listener on localhost (for example, localhost:13306 for MySQL).
    • socat: Bridges the VM's network interface to the tunnel listener (for example, NIC:3306 → localhost:13306), so traffic from Databricks can reach the tunnel endpoint.
    • HTTP health check service: Returns HTTP 200 when the tunnel is active and HTTP 503 when it is not. A plain TCP probe only detects whether socat is listening; an application-level HTTP probe detects a dead tunnel even when socat is still running.
  • Load balancer:
    • Frontend: Private IP in the proxy subnet.
    • Backend pool: All proxy VMs.
    • Load balancing rule: TCP on the resource port (for example, port 3306 for MySQL).
    • Health probe: HTTP GET against the health check endpoint on each proxy VM. A recommended starting point is a 5-second interval with 2 consecutive failures to mark a VM unhealthy — tune to your recovery tolerance.
  • Private connectivity service (required for serverless compute): Attached to the load balancer frontend with a dedicated NAT subnet. AWS uses a VPC Endpoint Service.
  • On-premises tunnel host: Runs one autossh process per proxy VM. A single autossh connection supports multiple -R port forwards (one SSH connection, multiple tunnels) for multi-resource setups. Use systemd services with Restart=always. Interactive tunnels terminate at logoff and are not suitable for production.

Configure Databricks

Create a connection to your on-premises resource using the load balancer frontend IP. Select the tab for your compute type.

note

The following examples use MySQL. For other databases, substitute the connection type, port, and JDBC driver Maven coordinate.

Before connecting, confirm that peering between the proxy hub VPC and your Databricks workspace VPC is active. Then use the load balancer frontend private IP in your connection configuration:

SQL
CREATE CONNECTION mysql_onprem TYPE mysql
OPTIONS (
host '<lb-frontend-ip>',
port '3306',
user '<db-user>',
password '<db-password>'
);

CREATE FOREIGN CATALOG onprem_catalog
USING CONNECTION mysql_onprem
OPTIONS (database '<db-name>');

Queries fail in Shared access mode because classloader isolation prevents executors from accessing Maven-based JDBC drivers. Use Single User access mode to verify the driver is available across the entire cluster. Before creating the connection, add the JDBC driver to the Unity Catalog allowlist:

SQL
ALTER METASTORE ADD ALLOWLIST maven ('mysql:mysql-connector-java:8.0.33');

High availability and pipeline resilience

With two or more proxy VMs in the backend pool, a tunnel failure on a single instance does not interrupt service. If a tunnel fails, the health check service returns HTTP 503. The load balancer then stops routing new connections to that VM within approximately 10 seconds.

note

Use the load balancer frontend IP in your connection string, not the IP of an individual proxy VM. If a tunnel drops, the load balancer automatically reroutes traffic without manual intervention or pipeline downtime.

Failure scenario

Without application health check

With application health check

Proxy VM stops responding

Load balancer detects → failover

Load balancer detects → failover

Port forwarder stops

Load balancer detects → failover

Load balancer detects → failover

SSH tunnel fails, port forwarder running

Load balancer can't detect → intermittent failures

Load balancer detects (HTTP 503) → failover

For Lakeflow Connect CDC pipelines, set binary log retention to at least 7 days. If the CDC gateway is offline when the binary logs expire, the pipeline must perform a full re-snapshot of all source tables.

Known limitations

This solution has the following limitations.

  • Each on-premises resource requires a distinct port mapping on each proxy VM. For multiple resources of the same type on the same default port, use different ports on the proxy VM's network interface (for example, 3306, 3307, or 3308) or use separate proxy VMs.
  • You must provision and maintain the proxy VMs and the on-premises tunnel host.
  • A load balancer blocks default outbound internet connectivity for VMs in the backend pool. Install required packages before adding VMs to the pool.
  • The Test Connection button in the Databricks UI does not work for private endpoint connections.
  • In Shared access mode, Maven JDBC libraries are only available on the driver node. Use Single User access mode for JDBC workloads.

Next steps