Connect to on-premises resources using an SSH reverse tunnel
Connect your on-premises resources to Databricks without opening inbound firewall access. An on-premises tunnel host opens an outbound SSH connection to proxy virtual machines (VMs) in AWS, allowing Databricks classic and serverless compute to reach your on-premises resources.
How it works
An SSH reverse tunnel lets an on-premises tunnel host open outbound SSH connections to cloud proxy VMs in AWS. Databricks connects to the proxy VMs through a load balancer, and traffic flows back through the tunnel to the on-premises resource. The on-premises network requires only outbound SSH (port 22) to AWS, so no incoming ports are required.
In a reverse tunnel, the on-premises host initiates the connection outbound (on-premises to cloud) to avoid relaxing firewall restrictions, and the return traffic flows back (cloud to on-premises) over the established path.
Classic compute reaches the proxy VMs through peering. Serverless compute reaches them through a private endpoint connection using your cloud provider's private connectivity service.
This is a self-managed solution. You provision and maintain the proxy VMs and the on-premises tunnel host.
Required and optional components
This setup requires a dedicated network circuit between your cloud environment and your on-premises network. This circuit allows the on-premises tunnel host to initiate outbound SSH to the proxy VMs' private IP addresses. Common options include AWS Direct Connect or a VPN tunnel.
Required (minimum working setup):
- An on-premises tunnel host running
autosshto establish the outbound SSH connection. - One proxy VM in the cloud running
socatto accept the tunnel and expose the resource port on its network interface. - A network path from Databricks to the proxy VM:
- Classic compute: peering between the Databricks VPC and the proxy hub VPC.
- Serverless compute: a private endpoint connection using your cloud provider's private connectivity service and a Network Connectivity Configuration (NCC) with a private endpoint rule.
- Both compute types: configure both paths.
A single proxy VM without a load balancer is sufficient for development and testing.
Optional (adds high availability and production robustness):
- Additional proxy VMs for redundancy.
- A load balancer fronting the proxy VMs. Provides a stable endpoint and automatic failover when a tunnel fails.
- An HTTP health check service on each proxy VM. This allows the load balancer to detect tunnel-level failures, not only VM or port availability.
Databricks recommends this configuration, but adapt it to your environment and security requirements.
Tunnel proxy hub
The proxy hub consists of the following components.
- Proxy VMs (at least two for high availability). Each proxy VM runs three services:
- sshd: The SSH daemon accepts incoming reverse tunnels from the on-premises tunnel host and places the tunnel listener on
localhost(for example,localhost:13306for MySQL). - socat: Bridges the VM's network interface to the tunnel listener (for example,
NIC:3306 → localhost:13306), so traffic from Databricks can reach the tunnel endpoint. - HTTP health check service: Returns HTTP 200 when the tunnel is active and HTTP 503 when it is not. A plain TCP probe only detects whether socat is listening; an application-level HTTP probe detects a dead tunnel even when socat is still running.
- sshd: The SSH daemon accepts incoming reverse tunnels from the on-premises tunnel host and places the tunnel listener on
- Load balancer:
- Frontend: Private IP in the proxy subnet.
- Backend pool: All proxy VMs.
- Load balancing rule: TCP on the resource port (for example, port 3306 for MySQL).
- Health probe: HTTP GET against the health check endpoint on each proxy VM. A recommended starting point is a 5-second interval with 2 consecutive failures to mark a VM unhealthy — tune to your recovery tolerance.
- Private connectivity service (required for serverless compute): Attached to the load balancer frontend with a dedicated NAT subnet. AWS uses a VPC Endpoint Service.
- On-premises tunnel host: Runs one
autosshprocess per proxy VM. A singleautosshconnection supports multiple-Rport forwards (one SSH connection, multiple tunnels) for multi-resource setups. Use systemd services withRestart=always. Interactive tunnels terminate at logoff and are not suitable for production.
Configure Databricks
Create a connection to your on-premises resource using the load balancer frontend IP. Select the tab for your compute type.
The following examples use MySQL. For other databases, substitute the connection type, port, and JDBC driver Maven coordinate.
- Classic compute
- Serverless compute
- Lakeflow Connect CDC
Before connecting, confirm that peering between the proxy hub VPC and your Databricks workspace VPC is active. Then use the load balancer frontend private IP in your connection configuration:
CREATE CONNECTION mysql_onprem TYPE mysql
OPTIONS (
host '<lb-frontend-ip>',
port '3306',
user '<db-user>',
password '<db-password>'
);
CREATE FOREIGN CATALOG onprem_catalog
USING CONNECTION mysql_onprem
OPTIONS (database '<db-name>');
Queries fail in Shared access mode because classloader isolation prevents executors from accessing Maven-based JDBC drivers. Use Single User access mode to verify the driver is available across the entire cluster. Before creating the connection, add the JDBC driver to the Unity Catalog allowlist:
ALTER METASTORE ADD ALLOWLIST maven ('mysql:mysql-connector-java:8.0.33');
-
As an account administrator, go to the account console.
-
In the sidebar, click Security.
-
Click Network connectivity configurations and create an NCC for your workspace region.
-
In the NCC, add a private endpoint rule and enter the service resource ID.
-
Attach the NCC to your workspace and wait 10–15 minutes for propagation.
-
Approve the private endpoint connection on the private connectivity service:
Bashaws ec2 accept-vpc-endpoint-connections \
--service-id <endpoint-service-id> \
--vpc-endpoint-ids <endpoint-id> -
Create the connection using the private endpoint domain:
SQLCREATE CONNECTION mysql_onprem_serverless TYPE mysql
OPTIONS (
host '<pe-domain>',
port '3306',
user '<db-user>',
password '<db-password>'
);
The Test Connection button in the Databricks UI does not work for private endpoint connections. Skip it and create the connection directly.
The Lakeflow Connect gateway runs on classic compute in your workspace VPC and reaches the proxy VMs through peering. Use the load balancer frontend private IP as the connection host, not the IP of an individual proxy VM. See High availability and pipeline resilience.
Before creating a CDC pipeline, complete the following steps for your database engine:
-
Enable change logging: Configure your database to log row-level changes (for example, binary logging in MySQL, logical replication in PostgreSQL, or supplemental logging in Oracle).
-
Grant replication permissions: Provide the pipeline user with the permissions required to read change logs and perform snapshots. Refer to the connector documentation for your specific database.
-
Set log retention: Configure log retention to at least seven days. If the CDC gateway is offline when logs expire, the pipeline must perform a full re-snapshot of all source tables.
For engine-specific configuration, see the Lakeflow Connect connector documentation.
High availability and pipeline resilience
With two or more proxy VMs in the backend pool, a tunnel failure on a single instance does not interrupt service. If a tunnel fails, the health check service returns HTTP 503. The load balancer then stops routing new connections to that VM within approximately 10 seconds.
Use the load balancer frontend IP in your connection string, not the IP of an individual proxy VM. If a tunnel drops, the load balancer automatically reroutes traffic without manual intervention or pipeline downtime.
Failure scenario | Without application health check | With application health check |
|---|---|---|
Proxy VM stops responding | Load balancer detects → failover | Load balancer detects → failover |
Port forwarder stops | Load balancer detects → failover | Load balancer detects → failover |
SSH tunnel fails, port forwarder running | Load balancer can't detect → intermittent failures | Load balancer detects (HTTP 503) → failover |
For Lakeflow Connect CDC pipelines, set binary log retention to at least 7 days. If the CDC gateway is offline when the binary logs expire, the pipeline must perform a full re-snapshot of all source tables.
Known limitations
This solution has the following limitations.
- Each on-premises resource requires a distinct port mapping on each proxy VM. For multiple resources of the same type on the same default port, use different ports on the proxy VM's network interface (for example, 3306, 3307, or 3308) or use separate proxy VMs.
- You must provision and maintain the proxy VMs and the on-premises tunnel host.
- A load balancer blocks default outbound internet connectivity for VMs in the backend pool. Install required packages before adding VMs to the pool.
- The Test Connection button in the Databricks UI does not work for private endpoint connections.
- In Shared access mode, Maven JDBC libraries are only available on the driver node. Use Single User access mode for JDBC workloads.