Ingest files from SFTP servers
The SFTP connector is in Public Preview.
Learn how to ingest files from SFTP servers using Lakeflow Connect. The SFTP connector extends Auto Loader functionality to provide secure, incremental ingestion from SFTP servers with Unity Catalog governance.
Key features
The SFTP connector offers the following:
- Primary key and password-based authentication.
- Incremental file ingestion and processing with exactly-once guarantees.
- Automatic schema inference, evolution, and data rescue.
- Unity Catalog governance for secure ingestion and credentials.
- Wide file format support:
JSON,CSV,XML,PARQUET,AVRO,TEXT,BINARYFILE, andORC. - Built-in support for pattern and wildcard matching to easily target data subsets.
- Availability on all compute types, including Lakeflow Spark Declarative Pipelines, Databricks SQL, serverless and classic with Databricks Runtime 17.3 and above.
Before you begin
To create the connection and the ingestion pipeline, you must have the following:
- A workspace with Unity Catalog enabled.
CREATE CONNECTIONprivileges.- Compute that uses Databricks Runtime version 17.3 or above.
Configure SFTP
First, confirm that the source SFTP server is accessible to your Databricks cluster environment:
- Make sure that the remote server is available in the VPC that has been configured with your workspace.
- Make sure that your SSH rules allow for the IP range of the Databricks VPC (if you're using classic compute) or the stable IPs (if you're using serverless compute).
- From the classic compute plane, set up a stable IP address with a load balancer, NAT gateway, internet gateway, or equivalent, and connect it to the subnet where Databricks compute is deployed. This allows the compute resource to share a stable public IP address that can be allowlisted on the SFTP server side. For instructions to configure network settings, see VPC peering.
- From serverless compute plane, see Step 1: Create a network connectivity configuration and copy the stable IPs to obtain stable egress IP.
Create the connection
Create a Unity Catalog connection to store your SFTP credentials. You must have CREATE CONNECTION privileges.
The connector supports the following authentication methods:
- PEM private key
- Password-based authentication
Databricks recommends using PEM private key authentication. Databricks also recommends using credentials with the least privilege on the source SFTP server (for example, a non-root user limited to read-only access).
When you create the pipeline, the connector attempts to automatically find a connection that you can use and that matches the host. If there are multiple matching connections, the connector chooses the first connection that successfully connects to the host. However, Databricks recommends that you explicitly specify the user. This ensures that the connector doesn't choose a connection for another user with host access.
PEM private key (recommended)
- Catalog Explorer
- SQL
-
In the Databricks workspace, click Catalog > External Data > Connections.
-
Click Create connection.
-
On the Connection basics page of the Set up connection wizard, enter a unique Connection name.
-
For Connection type select SFTP.
-
For Auth type, select PEM Private Key.
-
Click Next.
-
On the Authentication page, for Host, enter the host name of the foreign server.
-
For User, enter the user identity used to access the foreign instance.
-
Click Next.
-
On the Connection details page, enter the private key in PEM format. Also provide the key passphrase, if applicable.
-
If you want to skip host key fingerprint check, deselect Enforce host key fingerprint.
When this is selected, the connection only proceeds if the server's public key matches the expected SHA-256 fingerprint. When disabled, the connection proceeds regardless of the match. Check with your network administrator before you disable this.
-
If Enforce host key fingerprint is checked, enter the SFTP server fingerprint.
You can retrieve the fingerprint from your server administrator or by using CLI commands. You can also press Test and create connection > Test. The resulting error message provides the fingerprint. For example:
ECDSA key fingerprint is SHA256:XXX/YYY -
Click Test and create connection.
-
If the connection is successful, click Create.
-- Create a connection using a username and SSH private key.
CREATE CONNECTION my_sftp_connection
TYPE sftp
OPTIONS (
host 'my.sftpserver.com',
-- The following credentials can also be used in-line, but Databricks recommends
-- accessing them using the secrets scope.
user secret('my_secret_scope','my_sftp_username'),
pem_private_key secret('my_secret_scope','my_sftp_private_key'),
-- Port for the host
port '22',
-- Passphrase for the private key (optional).
pem_key_passphrase secret('my_secret_scope','my_sftp_private_key_passphrase'),
-- SFTP server fingerprint. You can retrieve this from your server administrator or using CLI commands.
key_fingerprint 'SHA256:ASampleFingerprintValueZy...',
);
Password-based authentication
- Catalog Explorer
- SQL
-
In the Databricks workspace, click Catalog > External Data > Connections.
-
Click Create connection.
-
On the Connection basics page of the Set up connection wizard, enter a unique Connection name.
-
For Connection type select SFTP.
-
For Auth type, select Username and password.
-
Click Next.
-
On the Authentication page, for Host, enter the host name of the foreign server.
-
For User, enter the user identity used to access the foreign instance.
-
For Password, enter the password of the foreign instance.
-
Click Next.
-
If you want to skip host key fingerprint check, deselect Enforce host key fingerprint.
When this is selected, the connection only proceeds if the server's public key matches the expected SHA-256 fingerprint. When disabled, the connection proceeds regardless of the match. Check with your network administrator before you disable this.
-
If Enforce host key fingerprint is checked, enter the SFTP server fingerprint.
You can retrieve the fingerprint from your server administrator or by using CLI commands. You can also press Test and create connection > Test. The resulting error message provides the fingerprint. For example:
ECDSA key fingerprint is SHA256:XXX/YYY -
Click Test and create connection.
-
If the connection is successful, click Create.
-- Create a connection using a username and password.
CREATE CONNECTION my_sftp_connection
TYPE sftp
OPTIONS (
host 'my.sftpserver.com',
user secret('my_secret_scope','my_sftp_username'),
password secret('my_secret_scope','my_sftp_password'),
-- Port for the host.
port '22',
-- SFTP server fingerprint. You can retrieve this from your server administrator or using CLI commands.
key_fingerprint 'SHA256:ASampleFingerprintValueZy...',
);
Read files from the SFTP server
The following examples show how to read files from SFTP server using Auto Loader's streaming capabilities. For details about Auto Loader usage, see Common data loading patterns.
# Run the Auto Loader job to ingest all existing data in the SFTP server.
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.schemaLocation", "<path to store schema information>") # This is a cloud storage path
.option("cloudFiles.format", "csv") # Or other format supported by Auto Loader
# 1. $absolute_path_to_files should be the full server-side file system path
# relative to the root /, e.g. /home/$username/subdir/source/…
.load("sftp://$your_user@$your_host:$your_port/$absolute_path_to_files")
.writeStream
.format("delta")
.option("checkpointLocation", "<path to store checkpoint information>") # This is a cloud storage path.
.trigger(availableNow = True)
.table("<table name>"))
df.awaitTermination()
The following examples show how to read files from an SFTP server using Auto Loader in Lakeflow Spark Declarative Pipelines:
- Python
- SQL
import dlt
@dlt.table
def sftp_bronze_table():
return (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv") # Or other format supported by Auto Loader
# 1. $absolute_path_to_files should be the full server side file system path
# relative to the root /, for example /home/$username/subdir/source/…
.load("sftp://$your_user@$your_host:$your_port/$absolute_path_to_files"))
CREATE OR REFRESH STREAMING TABLE sftp_bronze_table
AS SELECT * FROM STREAM read_files(
"sftp://$your_user@$your_host:$your_port/$absolute_path_to_files",
format => "csv"
)
Configure Auto Loader options. All options are supported except:
cloudFiles.useNotificationscloudFiles.useManagedFileEventscloudFiles.cleanSource- Cloud-specific options
Limitations
- SFTP is not supported across other ingestion surfaces, including
COPY INTO,spark.read, anddbutils.ls. - Writing back to an SFTP server is not supported.
- Auto Loader
cleanSource(deleting or archiving files at source after ingestion) is not supported. - The FTP protocol is not supported.
FAQ
Find answers to frequently asked questions about the SFTP connector.
How do I use wildcards or file name patterns to select files to ingest?
The SFTP connector builds on the standard Auto Loader framework to read from SFTP servers. This means that all Auto Loader options are supported. For file name patterns and wildcards, use the pathGlobFilter or fileNamePattern options. See Auto Loader options.
Can the SFTP connector ingest encrypted files? (Is PGP supported?)
The connector doesn't decrypt in flight, but you can ingest the encrypted files as binary files and decrypt them after ingestion.
How do I handle incompatible private key formats?
Only PEM format is supported. You can generate a private key in PEM format by doing one of the following:
-
(Option 1) Create a new RSA key in the standard PEM format:
% ssh-keygen -t rsa -m pem -
(Option 2) Convert the existing OpenSSH-format key to PEM format:
% ssh-keygen -p -m pem -f /path/to/key # This updates the key file.