Ingest files from SFTP servers

Preview

The SFTP connector is in Public Preview.

Learn how to ingest files from SFTP servers using Lakeflow Connect. The SFTP connector extends Auto Loader functionality to provide secure, incremental ingestion from SFTP servers with Unity Catalog governance.

Key features

The SFTP connector offers the following:

Private key and password-based authentication.
Incremental file ingestion and processing with exactly-once guarantees.
Automatic schema inference, evolution, and data rescue.
Unity Catalog governance for secure ingestion and credentials.
Wide file format support: JSON, CSV, XML, PARQUET, AVRO, TEXT, BINARYFILE, and ORC.
Built-in support for pattern and wildcard matching to easily target data subsets.
Availability on all compute types, including Lakeflow Spark Declarative Pipelines, Databricks SQL, serverless and classic with Databricks Runtime 17.3 and above.

Before you begin

To create the connection and the ingestion pipeline, you must have the following:

A workspace with Unity Catalog enabled.
CREATE CONNECTION privileges.
Compute that uses Databricks Runtime version 17.3 or above.

Configure SFTP

First, confirm that the source SFTP server is accessible to your Databricks cluster environment:

Make sure that the remote server is available in the VPC that has been configured with your workspace.
Make sure that your SSH rules allow for the IP range of the Databricks VPC (if you're using classic compute) or the stable IPs (if you're using serverless compute).
From the classic compute plane, set up a stable IP address with a load balancer, NAT gateway, internet gateway, or equivalent, and connect it to the subnet where Databricks compute is deployed. This allows the compute resource to share a stable public IP address that can be allowlisted on the SFTP server side. For instructions to configure network settings, see VPC peering.
From serverless compute plane, see Step 1: Create a network connectivity configuration and copy the stable IPs to obtain stable egress IP.

Create the connection

Create a Unity Catalog connection to store your SFTP credentials. You must have CREATE CONNECTION privileges.

The connector supports the following authentication methods:

PEM private key
Password-based authentication

Databricks recommends using PEM private key authentication. Databricks also recommends using credentials with the least privilege on the source SFTP server (for example, a non-root user limited to read-only access).

When you create the pipeline, the connector attempts to automatically find a connection that you can use and that matches the host. If there are multiple matching connections, the connector chooses the first connection that successfully connects to the host. However, Databricks recommends that you explicitly specify the user. This ensures that the connector doesn't choose a connection for another user with host access.

PEM private key (recommended)

Catalog Explorer
SQL

In the Databricks workspace, click Catalog > External Data > Connections.
Click Create connection.
On the Connection basics page of the Set up connection wizard, enter a unique Connection name.
For Connection type select SFTP.
For Auth type, select PEM Private Key.
Click Next.
On the Authentication page, for Host, enter the host name of the foreign server.
For User, enter the user identity used to access the foreign instance.
Click Next.
On the Connection details page, enter the private key in PEM format. Also provide the key passphrase, if applicable.
If you want to skip host key fingerprint check, deselect Enforce host key fingerprint.

When this is selected, the connection only proceeds if the server's public key matches the expected SHA-256 fingerprint. When disabled, the connection proceeds regardless of the match. Check with your network administrator before you disable this.
If Enforce host key fingerprint is checked, enter the SFTP server fingerprint.

You can retrieve the fingerprint from your server administrator or by using CLI commands. You can also press Test and create connection > Test. The resulting error message provides the fingerprint. For example:

ECDSA key fingerprint is SHA256:XXX/YYY
Click Test and create connection.
If the connection is successful, click Create.

SQL
-- Create a connection using a username and SSH private key.
CREATE CONNECTION my_sftp_connection
TYPE sftp
OPTIONS (
  host 'my.sftpserver.com',
  -- The following credentials can also be used in-line, but Databricks recommends
  -- accessing them using the secrets scope.
  user secret('my_secret_scope','my_sftp_username'),
  pem_private_key secret('my_secret_scope','my_sftp_private_key'),
  -- Port for the host
  port '22',
  -- Passphrase for the private key (optional).
  pem_key_passphrase secret('my_secret_scope','my_sftp_private_key_passphrase'),
  -- SFTP server fingerprint. You can retrieve this from your server administrator or using CLI commands.
  key_fingerprint 'SHA256:ASampleFingerprintValueZy...',
);

Password-based authentication

Catalog Explorer
SQL

In the Databricks workspace, click Catalog > External Data > Connections.
Click Create connection.
On the Connection basics page of the Set up connection wizard, enter a unique Connection name.
For Connection type select SFTP.
For Auth type, select Username and password.
Click Next.
On the Authentication page, for Host, enter the host name of the foreign server.
For User, enter the user identity used to access the foreign instance.
For Password, enter the password of the foreign instance.
Click Next.
If you want to skip host key fingerprint check, deselect Enforce host key fingerprint.

When this is selected, the connection only proceeds if the server's public key matches the expected SHA-256 fingerprint. When disabled, the connection proceeds regardless of the match. Check with your network administrator before you disable this.
If Enforce host key fingerprint is checked, enter the SFTP server fingerprint.

You can retrieve the fingerprint from your server administrator or by using CLI commands. You can also press Test and create connection > Test. The resulting error message provides the fingerprint. For example:

ECDSA key fingerprint is SHA256:XXX/YYY
Click Test and create connection.
If the connection is successful, click Create.

SQL
-- Create a connection using a username and password.
CREATE CONNECTION my_sftp_connection
TYPE sftp
OPTIONS (
  host 'my.sftpserver.com',
  user secret('my_secret_scope','my_sftp_username'),
  password secret('my_secret_scope','my_sftp_password'),
  -- Port for the host.
  port '22',
  -- SFTP server fingerprint. You can retrieve this from your server administrator or using CLI commands.
  key_fingerprint 'SHA256:ASampleFingerprintValueZy...',
);

Read files from the SFTP server

The following examples show how to read files from SFTP server using Auto Loader's streaming capabilities. For details about Auto Loader usage, see Common data loading patterns.

Python
# Run the Auto Loader job to ingest all existing data in the SFTP server.
df = (spark.readStream.format("cloudFiles")
  .option("cloudFiles.schemaLocation", "<path to store schema information>") # This is a cloud storage path
  .option("cloudFiles.format", "csv") # Or other format supported by Auto Loader
  # Specify the absolute path on the SFTP server starting from the root /.
  # Example: /home/<username>/data/files or /uploads/csv_files
  .load("sftp://<username>@<host>:<port>/<absolute_path_to_files>")
  .writeStream
  .format("delta")
  .option("checkpointLocation", "<path to store checkpoint information>") # This is a cloud storage path.
  .trigger(availableNow = True)
  .table("<table name>"))
df.awaitTermination()

The following examples show how to read files from an SFTP server using Auto Loader in Lakeflow Spark Declarative Pipelines:

Python
SQL

Python
from pyspark import pipelines as dp

@dp.table
def sftp_bronze_table():
  return (spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "csv") # Or other format supported by Auto Loader
    # Specify the absolute path on the SFTP server starting from the root /.
    # Example: /home/username/data/files or /uploads/csv_files
    .load("sftp://<username>@<host>:<port>/<absolute_path_to_files>")))

SQL
CREATE OR REFRESH STREAMING TABLE sftp_bronze_table
AS SELECT * FROM STREAM read_files(
  "sftp://<username>@<host>:<port>/<absolute_path_to_files>",
  format => "csv"
)

Configure Auto Loader options. All options are supported except:

cloudFiles.useNotifications
cloudFiles.useManagedFileEvents
cloudFiles.cleanSource
Cloud-specific options

Limitations

SFTP is not supported across other ingestion surfaces, including COPY INTO, spark.read, and dbutils.ls.
Writing back to an SFTP server is not supported.
Auto Loader cleanSource (deleting or archiving files at source after ingestion) is not supported.
The FTP protocol is not supported.

FAQ

Find answers to frequently asked questions about the SFTP connector.

How do I use wildcards or file name patterns to select files to ingest?

The SFTP connector builds on the standard Auto Loader framework to read from SFTP servers. This means that all Auto Loader options are supported. For file name patterns and wildcards, use the pathGlobFilter or fileNamePattern options. See Auto Loader options.

Can the SFTP connector ingest encrypted files? (Is PGP supported?)

The connector doesn't decrypt in flight, but you can ingest the encrypted files as binary files and decrypt them after ingestion.

How do I handle incompatible private key formats?

Only PEM format is supported. You can generate a private key in PEM format by doing one of the following:

(Option 1) Create a new RSA key in the standard PEM format:
```
% ssh-keygen -t rsa -m pem
```

(Option 2) Convert the existing OpenSSH-format key to PEM format:

% ssh-keygen -p -m pem -f /path/to/key  # This updates the key file.

Key features​

Before you begin​

Configure SFTP​

Create the connection​

PEM private key (recommended)​

Password-based authentication​

Read files from the SFTP server​

Limitations​

FAQ​

How do I use wildcards or file name patterns to select files to ingest?​

Can the SFTP connector ingest encrypted files? (Is PGP supported?)​

How do I handle incompatible private key formats?​

Additional resources​