Configure Structured Streaming batch size on Databricks
Limiting the input rate for Structured Streaming queries helps to maintain a consistent batch size and prevents large batches from leading to spill and cascading micro-batch processing delays.
Databricks provides the same options to control Structured Streaming batch sizes for both Delta Lake and Auto Loader.
Limit input rate with maxFilesPerTrigger
cloudFiles.maxFilesPerTrigger for Auto Loader) specifies an upper-bound for the number of files processed in each micro-batch. For both Delta Lake and Auto Loader the default is 1000. (Note that this option is also present in Apache Spark for other file sources, where there is no max by default.)
Limit input rate with maxBytesPerTrigger
cloudFiles.maxBytesPerTrigger for Auto Loader) sets a “soft max” for the amount of data processed in each micro-batch. This means that a batch processes approximately this amount of data and may process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit. There is no default for this setting.
For example, if you specify a byte string such as
10g to limit each microbatch to 10 GB of data and you have files that are 3 GB each, Databricks processes 12 GB in a microbatch.
Setting multiple input rates together
If you use
maxBytesPerTrigger in conjunction with
maxFilesPerTrigger, the micro-batch processes data until reaching the lower limit of either
Limiting input rates for other Structured Streaming sources
Streaming sources such as Apache Kafka each have custom input limits, such as
maxOffsetsPerTrigger. For more details, see Work with streaming data sources on Databricks.