Configure Auto Loader streams in directory listing mode

This page describes how to configure Auto Loader streams to use directory listing mode to incrementally discover and ingest cloud data.

Auto Loader uses directory listing mode by default. In directory listing mode, Auto Loader identifies new files by listing the input directory. Directory listing mode allows you to quickly start Auto Loader streams without any permission configurations other than access to your data on cloud storage.

For best performance with directory listing mode, use Databricks Runtime 9.1 or above. This article describes the default functionality of directory listing mode as well as optimizations based on lexical ordering of files.

note

Databricks recommends file notification mode using file events on external locations instead of directory listing mode for most workloads, particularly with continuous triggers such as Trigger.ProcessingTime, where directory listing mode continuously lists the entire directory and can significantly increase LIST API costs. If you are using Auto Loader in directory listing mode today, Databricks recommends that you migrate to file notification mode using file events to see significant performance improvements and decreased costs. See Configure Auto Loader streams in file notification mode.

How does directory listing mode work?

Databricks has optimized directory listing mode for Auto Loader to discover files in cloud storage more efficiently than other Apache Spark options.

For example, if you have files being uploaded every 5 minutes as /some/path/YYYY/MM/DD/HH/fileName, to find all the files in these directories, the Apache Spark file source lists all subdirectories in parallel. The following algorithm estimates the total number of API LIST directory calls to object storage:

1 (base directory) + 365 (per day) * 24 (per hour) = 8761 calls

By receiving a flattened response from storage, Auto Loader reduces the number of API calls to the number of files in storage divided by the number of results returned by each API call, greatly reducing your cloud costs. The following table shows the number of files returned by each API call for common object storage:

Results returned per call	Object storage
1000	S3
5000	ADLS
1024	GCS

Incremental Listing (deprecated)

important

Incremental listing is deprecated. Databricks recommends using file notification mode with file events instead. Incremental listing does not guarantee file processing order. Do not use it for ordered file ingestion.

note

Available in Databricks Runtime 9.1 LTS and above.

Incremental listing is available for Azure Data Lake Storage (abfss://), S3 (s3://) and GCS (gs://).

For lexicographically generated files, Auto Loader leverages the lexical file ordering and optimized listing APIs to improve the efficiency of directory listing by listing from recently ingested files rather than listing the contents of the entire directory.

When cloudFiles.useIncrementalListing is set to auto, Auto Loader automatically detects whether a given directory is applicable for incremental listing by checking and comparing file paths of previously completed directory listings. To ensure eventual completeness of data in auto mode, Auto Loader automatically triggers a full directory list after completing 7 consecutive incremental lists. You can control the frequency of full directory lists by setting cloudFiles.backfillInterval to trigger asynchronous backfills at a given interval.

Lexical ordering of files

For files to be lexically ordered, new files that are uploaded need to have a prefix that is lexicographically greater than existing files. Some examples of lexical ordered directories are shown below.

Lexical ordering improves file discovery efficiency, but Auto Loader does not guarantee the order in which files are discovered or processed. Design your pipelines to handle out-of-order file arrivals. For guidance, see Handle out-of-order data.

Versioned files

Delta Lake makes commits to table transaction logs in a lexical order.

<path-to-table>/_delta_log/00000000000000000000.json
<path-to-table>/_delta_log/00000000000000000001.json <- guaranteed to be written after version 0
<path-to-table>/_delta_log/00000000000000000002.json <- guaranteed to be written after version 1
...

AWS DMS uploads CDC files to AWS S3 in a versioned manner.

database_schema_name/table_name/LOAD00000001.csv
database_schema_name/table_name/LOAD00000002.csv
...

Date partitioned files

Files can be uploaded in a date partitioned format. Some examples of this are:

// <base-path>/yyyy/MM/dd/HH:mm:ss-randomString
<base-path>/2021/12/01/10:11:23-b1662ecd-e05e-4bb7-a125-ad81f6e859b4.json
<base-path>/2021/12/01/10:11:23-b9794cf3-3f60-4b8d-ae11-8ea320fad9d1.json
...

// <base-path>/year=yyyy/month=MM/day=dd/hour=HH/minute=mm/randomString
<base-path>/year=2021/month=12/day=04/hour=08/minute=22/442463e5-f6fe-458a-8f69-a06aa970fc69.csv
<base-path>/year=2021/month=12/day=04/hour=08/minute=22/8f00988b-46be-4112-808d-6a35aead0d44.csv <- this may be uploaded before the file above as long as processing happens less frequently than a minute

When files are uploaded with date partitioning, some things to keep in mind are:

Months, days, hours, minutes need to be left padded with zeros to ensure lexical ordering (should be uploaded as hour=03, instead of hour=3 or 2021/05/03 instead of 2021/5/3).
Files don't necessarily have to be uploaded in lexical order in the deepest directory as long as processing happens less frequently than the parent directory's time granularity.

Some services that can upload files in a date partitioned lexical ordering are:

Azure Data Factory can be configured to upload files in a lexical order. See an example here.
Kinesis Firehose

Change source path for Auto Loader

In Databricks Runtime 11.3 LTS and above, you can change the directory input path for Auto Loader configured with directory listing mode without having to choose a new checkpoint directory.

warning

This functionality is not supported for file notification mode. If file notification mode is used and the path is changed, you might fail to ingest files that are already present in the new directory at the time of the directory update.

For example, if you wish to run a daily ingestion job that loads all data from a directory structure organized by day, such as /YYYYMMDD/, you can use the same checkpoint to track ingestion state information across a different source directory each day while maintaining state information for files ingested from all previously used source directories.

How does directory listing mode work?​

Incremental Listing (deprecated)​

Lexical ordering of files​

Versioned files​

Date partitioned files​

Change source path for Auto Loader​

Additional resources​