Auto Loader FAQ

Commonly asked questions about Databricks Auto Loader.

Does Auto Loader process the file again when the file gets appended or overwritten?

With the default setting (cloudFiles.allowOverwrites = false), files are processed exactly once. When a file is appended to or overwritten, Auto Loader cannot guarantee which file version will be processed. To allow Auto Loader to process the file again when it is appended to or overwritten, you can set cloudFiles.allowOverwrites to true. In this case, Auto Loader is guaranteed to process the latest version of the file. However, Auto Loader cannot guarantee which intermediate version is processed.

Use caution if you enable cloudFiles.allowOverwrites in file notification mode. In file notification mode, Auto Loader might identify new files through both file notifications and directory listing. Because file notification event time and file modification time can differ, Auto Loader could receive two different timestamps and ingest the same file twice, even if the file hasn't been updated.

With cloudFiles.allowOverwrites enabled, you must handle duplicate records yourself. Auto Loader will reprocess the entire file even when it is appended to or partially updated. In general, Databricks recommends using Auto Loader to ingest immutable files only and using the default setting cloudFiles.allowOverwrites = false. If you have further questions, contact your Databricks account team.

If my data files do not arrive continuously, but in regular intervals, for example, once a day, should I still use this source and are there any benefits?

In this case, you can set up a Trigger.AvailableNow (available in Databricks Runtime 10.4 LTS and above) Structured Streaming job and schedule to run after the anticipated file arrival time. Auto Loader works well with both infrequent or frequent updates. Even if the eventual updates are very large, Auto Loader scales well to the input size. Auto Loader's efficient file discovery techniques and schema evolution capabilities make Auto Loader the recommended method for incremental data ingestion.

What happens if I change the checkpoint location when restarting the stream?

A checkpoint location maintains important identifying information of a stream. Changing the checkpoint location effectively means that you have abandoned the previous stream and started a new stream.

Do I need to create event notification services beforehand?

No. If you choose file notification mode and provide the required permissions, Auto Loader can create file notification services for you. See Manage file notification queues for each Auto Loader stream separately (legacy).

If file events are enabled on the external location in Unity Catalog, the file events service can create the file events in your cloud provider, and you don't need to configure Auto Loader to create them for each stream. See Use file notification mode with file events

How do I clean up the event notification resources created by Auto Loader?

You can use the cloud resource manager to list and tear down resources. You can also delete these resources manually using the cloud provider's UI or APIs.

Can I run multiple streaming queries from different input directories on the same bucket/container?

Yes, as long as they are not parent-child directories; for example, prod-logs/ and prod-logs/usage/ would not work because /usage is a child directory of /prod-logs.

Can I use this feature when there are existing file notifications on my bucket or container?

Yes, as long as your input directory does not conflict with the existing notification prefix (for example, the above parent-child directories).

How does Auto Loader infer schema?

When the DataFrame is first defined, Auto Loader lists your source directory and chooses the most recent (by file modification time) 50 GB of data or 1000 files, and uses those to infer your data schema.

Auto Loader also infers partition columns by examining the source directory structure and looks for file paths that contain the /key=value/ structure. If the source directory has an inconsistent structure, for example:

base/path/partition=1/date=2020-12-31/file1.json
// inconsistent because date and partition directories are in different orders
base/path/date=2020-12-31/partition=2/file2.json
// inconsistent because the date directory is missing
base/path/partition=3/file3.json

Auto Loader infers the partition columns as empty. Use cloudFiles.partitionColumns to explicitly parse columns from the directory structure.

How does Auto Loader behave when the source folder is empty?

If the source directory is empty, Auto Loader requires you to provide a schema as there is no data to perform inference.

When does Autoloader infer schema? Does it evolve automatically after every micro-batch?

The schema is inferred when the DataFrame is first defined in your code. During each micro-batch, schema changes are evaluated on the fly; therefore, you don't need to worry about performance hits. When the stream restarts, it picks up the evolved schema from the schema location and starts executing without any overhead from inference.

What's the performance impact on ingesting the data when using Auto Loader schema inference?

You should expect schema inference to take a couple of minutes for very large source directories during initial schema inference. You shouldn't observe significant performance hits otherwise during stream execution. If you run your code in a Databricks notebook, you can see status updates that specify when Auto Loader will be listing your directory for sampling and inferring your data schema.

Due to a bug, a bad file has changed my schema drastically. What should I do to roll back a schema change?

Contact Databricks support for help.

Does Auto Loader process the file again when the file gets appended or overwritten?​

If my data files do not arrive continuously, but in regular intervals, for example, once a day, should I still use this source and are there any benefits?​

What happens if I change the checkpoint location when restarting the stream?​

Do I need to create event notification services beforehand?​

How do I clean up the event notification resources created by Auto Loader?​

Can I run multiple streaming queries from different input directories on the same bucket/container?​

Can I use this feature when there are existing file notifications on my bucket or container?​

How does Auto Loader infer schema?​

How does Auto Loader behave when the source folder is empty?​

When does Autoloader infer schema? Does it evolve automatically after every micro-batch?​

What's the performance impact on ingesting the data when using Auto Loader schema inference?​

Due to a bug, a bad file has changed my schema drastically. What should I do to roll back a schema change?​