Auto Loader FAQ

Find answers to frequently asked questions about Databricks Auto Loader.

Does Auto Loader process the file again when the file gets appended or overwritten?

With the default setting (cloudFiles.allowOverwrites = false), files are processed exactly once. When a file is appended to or overwritten, Auto Loader cannot guarantee which file version will be processed. To allow Auto Loader to process the file again when it is appended to or overwritten, you can set cloudFiles.allowOverwrites to true. In this case, Auto Loader is guaranteed to process the latest version of the file. However, Auto Loader cannot guarantee which intermediate version is processed.

Use caution if you enable cloudFiles.allowOverwrites in file notification mode. In file notification mode, Auto Loader might identify new files through both file notifications and directory listing. Because file notification event time and file modification time can differ, Auto Loader could receive two different timestamps and ingest the same file twice, even if the file hasn't been updated.

With cloudFiles.allowOverwrites enabled, you must handle duplicate records yourself. Auto Loader will reprocess the entire file even when it is appended to or partially updated. In general, Databricks recommends using Auto Loader to ingest immutable files only and using the default setting cloudFiles.allowOverwrites = false. If you have further questions, contact your Databricks account team.

How does Auto Loader determine whether a file has been ingested or not?

Auto Loader normally ingests each file only once based on its file path. However, if you set the allowOverwrites option to true, Auto Loader also uses the file's last-modified timestamp to determine whether a file is new or has been updated and needs to be re-ingested. See Does Auto Loader process the file again when the file gets appended or overwritten?

If my data files do not arrive continuously, but in regular intervals, for example, once a day, should I still use this source and are there any benefits?

In this case, you can set up a Trigger.AvailableNow (available in Databricks Runtime 10.4 LTS and above) Structured Streaming job and schedule to run after the anticipated file arrival time. Auto Loader works well with both infrequent or frequent updates. Even if the eventual updates are very large, Auto Loader scales well to the input size. Auto Loader's efficient file discovery techniques and schema evolution capabilities make Auto Loader the recommended method for incremental data ingestion.

How does Auto Loader infer schema?

When the DataFrame is first defined, Auto Loader lists your source directory and chooses the most recent (by file modification time) 50 GB of data or 1000 files, and uses those to infer your data schema.

Auto Loader also infers partition columns by examining the source directory structure and looks for file paths that contain the /key=value/ structure. If the source directory has an inconsistent structure, for example:

base/path/partition=1/date=2020-12-31/file1.json
// inconsistent because date and partition directories are in different orders
base/path/date=2020-12-31/partition=2/file2.json
// inconsistent because the date directory is missing
base/path/partition=3/file3.json

Auto Loader infers the partition columns as empty. Use cloudFiles.partitionColumns to explicitly parse columns from the directory structure.

How does Auto Loader behave when the source folder is empty?

If the source directory is empty, Auto Loader requires you to provide a schema as there is no data to perform inference.

When does Auto Loader infer schema? Does it evolve automatically after every micro-batch?

The schema is inferred when the DataFrame is first defined in your code. During each micro-batch, schema changes are evaluated on the fly; therefore, you don't need to worry about performance hits. When the stream restarts, it picks up the evolved schema from the schema location and starts executing without any overhead from inference.

What's the performance impact on ingesting the data when using Auto Loader schema inference?

You should expect schema inference to take a couple of minutes for very large source directories during initial schema inference. You shouldn't observe significant performance hits otherwise during stream execution. If you run your code in a Databricks notebook, you can see status updates that specify when Auto Loader will be listing your directory for sampling and inferring your data schema.

Due to a bug, a bad file has changed my schema drastically. What should I do to roll back a schema change?

Contact Databricks support for help.

What happens if I change the checkpoint location when restarting the stream?

A checkpoint location maintains important identifying information of a stream. Changing the checkpoint location effectively means that you have abandoned the previous stream and started a new stream.

Do I need to create an event notification services beforehand?

No. If you choose file notification mode and provide the required permissions, Auto Loader can create file notification services for you. See Manage file notification queues for each Auto Loader stream separately (classic).

If file events are enabled on the external location in Unity Catalog, the file events service can create the file events in your cloud provider, and you don't need to configure Auto Loader to create them for each stream. See Use file notification mode with file events

Can I run multiple streaming queries from different input directories on the same bucket/container?

Yes, as long as they are not parent-child directories; for example, prod-logs/ and prod-logs/usage/ would not work because /usage is a child directory of /prod-logs.

Can I use this feature when there are existing file notifications on my bucket or container?

Yes, as long as your input directory does not conflict with the existing notification prefix (for example, the above parent-child directories).

Databricks does not recommend sharing an SQS queue between Auto Loader and other applications. Instead, forward your S3 event notifications to an SNS topic, then subscribe a separate SQS queue for each application to that topic. Use an SNS subscription filter policy to ensure that only the relevant messages are forwarded to each queue. Then provide the dedicated queue to Auto Loader.

How do I confirm that file events are set up correctly?

Click the Test Connection button on the external location page. If you set up file events correctly, you'll see a green checkmark for the File Events Read item. If you just created the external location and enabled file events in Automatic mode, the test shows Skipped while Databricks sets up notifications for the external location. Wait a few minutes, then click Test Connection again. If Databricks doesn't have the required permissions to set up or read from file events, you'll see an error for the File Events Read item.

Can I avoid a full directory listing during the initial run?

No. Even if includeExistingFiles is set to false, Auto Loader performs a directory listing to discover files created after the stream start and get current with the file events cache (secure a valid read position in the cache and store it in the stream's checkpoint).

Do I need to set `cloudFiles.backfillInterval` to avoid missing files?

No. Databricks previously recommended this setting for the classic file notification mode because cloud storage notification systems could result in missed or late-arriving files. Now, Databricks performs full directory listings on the external location. The first full directory listing begins as soon as file events are enabled on the external location. Each subsequent listing occurs 24 hours after the last full scan as long as there is at least one Auto Loader stream using file events to ingest data.

I set up file events with a provided storage queue, but the queue was misconfigured and I missed files. How do I make sure that Auto Loader ingests the files missed when my queue was misconfigured?

First, verify that the provided queue misconfiguration is fixed. To check, click the Test Connection button on the external location page. If you set up file events correctly, a green checkmark appears for the File Events Read item.

Databricks performs a full directory listing for external locations with file events enabled. This directory listing discovers any files that were missed during the misconfiguration period and stores them in the file events cache.

After the misconfiguration is fixed and Databricks completes the directory listing, Auto Loader continues to read from the file events cache and automatically ingests any files missed during the misconfiguration period.

How do I recover from a `CF_MANAGED_FILE_EVENTS_INVALID_CONTINUATION_TOKEN` error?

This error occurs when the continuation token stored in the Auto Loader checkpoint for the file events service has become invalid.

Some common causes:

cloudFiles.useManagedFileEvents was turned off and then on again.
Modification of the source's external location or volume.
Modification of the provided queue.
Changing the options cloudFiles.allowOverwrites or cloudFiles.readChangeFeed.

To recover:

Set .option("cloudFiles.listOnStart", "true") and .option("cloudFiles.validateOptions", false) on your streaming query.
Restart the stream. Auto Loader performs a full directory listing on start and bypasses the invalid continuation token.
After a successful micro-batch, remove both options and restart the stream.

For more information about the cloudFiles.listOnStart option, see File notification.

How do I clean up the event notification resources created by Auto Loader?

You can use the cloud resource manager to list and tear down resources. You can also delete these resources manually using the cloud provider's UI or APIs.

How do I monitor my Auto Loader pipeline?

Auto Loader exposes key metrics through StreamingQueryListener and file-level ingestion state through cloud_files_state(). For guidance on monitoring metrics, querying ingestion state, building observability dashboards, and troubleshooting common issues, see Monitor and observe Auto Loader.

Does Auto Loader process the file again when the file gets appended or overwritten?​

How does Auto Loader determine whether a file has been ingested or not?​

If my data files do not arrive continuously, but in regular intervals, for example, once a day, should I still use this source and are there any benefits?​

How does Auto Loader infer schema?​

How does Auto Loader behave when the source folder is empty?​

When does Auto Loader infer schema? Does it evolve automatically after every micro-batch?​

What's the performance impact on ingesting the data when using Auto Loader schema inference?​

Due to a bug, a bad file has changed my schema drastically. What should I do to roll back a schema change?​

What happens if I change the checkpoint location when restarting the stream?​

Do I need to create an event notification services beforehand?​

Can I run multiple streaming queries from different input directories on the same bucket/container?​

Can I use this feature when there are existing file notifications on my bucket or container?​

Can I share an SQS queue between Auto Loader and other applications?​

How do I confirm that file events are set up correctly?​

Can I avoid a full directory listing during the initial run?​

Do I need to set cloudFiles.backfillInterval to avoid missing files?​

I set up file events with a provided storage queue, but the queue was misconfigured and I missed files. How do I make sure that Auto Loader ingests the files missed when my queue was misconfigured?​

How do I recover from a CF_MANAGED_FILE_EVENTS_INVALID_CONTINUATION_TOKEN error?​

How do I clean up the event notification resources created by Auto Loader?​

How do I monitor my Auto Loader pipeline?​