Commonly asked questions about Databricks Auto Loader.
Files are processed exactly once unless
cloudFiles.allowOverwrites is enabled. When a file is appended to or overwritten, Databricks cannot guarantee which version of the file will be processed. You should also use caution when enabling
cloudFiles.allowOverwrites in file notification mode, where Auto Loader might identify new files through both file notifications and directory listing. Due to the discrepancy between file notification event time and file modification time, Auto Loader might obtain two different timestamps and therefore ingest the same file twice, even when the file is only written once.
In general, Databricks recommends you use Auto Loader to ingest only immutable files and avoid setting
cloudFiles.allowOverwrites. If this does not meet your requirements, contact your Databricks representative.
If my data files do not arrive continuously, but in regular intervals, for example, once a day, should I still use this source and are there any benefits?
In this case, you can set up a
Trigger.AvailableNow (available in Databricks Runtime 10.2 and later) Structured Streaming job and schedule to run after the anticipated file arrival time. Auto Loader works well with both infrequent or frequent updates. Even if the eventual updates are very large, Auto Loader scales well to the input size. Auto Loader’s efficient file discovery techniques and schema evolution capabilities make Auto Loader the recommended method for incremental data ingestion.
A checkpoint location maintains important identifying information of a stream. Changing the checkpoint location effectively means that you have abandoned the previous stream and started a new stream.
No. If you choose file notification mode and provide the required permissions, Auto Loader can create file notification services for you. See What is Auto Loader file notification mode?
You can use the cloud resource manager to list and tear down resources. You can also delete these resources manually using the cloud provider’s UI or APIs.
Yes, as long as they are not parent-child directories; for example,
prod-logs/usage/ would not work because
/usage is a child directory of
Yes, as long as your input directory does not conflict with the existing notification prefix (for example, the above parent-child directories).
When the DataFrame is first defined, Auto Loader lists your source directory and chooses the most recent (by file modification time) 50 GB of data or 1000 files, and uses those to infer your data schema.
Auto Loader also infers partition columns by examining the source directory structure and looks for file paths that contain the
/key=value/ structure. If the source directory has an inconsistent structure, for example:
base/path/partition=1/date=2020-12-31/file1.json // inconsistent because date and partition directories are in different orders base/path/date=2020-12-31/partition=2/file2.json // inconsistent because the date directory is missing base/path/partition=3/file3.json
Auto Loader infers the partition columns as empty. Use
cloudFiles.partitionColumns to explicitly parse columns from the directory structure.
If the source directory is empty, Auto Loader requires you to provide a schema as there is no data to perform inference.
The schema is inferred when the DataFrame is first defined in your code. During each micro-batch, schema changes are evaluated on the fly; therefore, you don’t need to worry about performance hits. When the stream restarts, it picks up the evolved schema from the schema location and starts executing without any overhead from inference.
You should expect schema inference to take a couple of minutes for very large source directories during initial schema inference. You shouldn’t observe significant performance hits otherwise during stream execution. If you run your code in a Databricks notebook, you can see status updates that specify when Auto Loader will be listing your directory for sampling and inferring your data schema.