Configure Structured Streaming batch size on Databricks

Limiting the input rate for Structured Streaming queries helps to maintain a consistent batch size and prevents large batches from leading to spill and cascading micro-batch processing delays.

Databricks provides the same options to control Structured Streaming batch sizes for both Delta Lake and Auto Loader.

Limit input rate with maxFilesPerTrigger

Setting maxFilesPerTrigger (or cloudFiles.maxFilesPerTrigger for Auto Loader) specifies an upper-bound for the number of files processed in each micro-batch. For both Delta Lake and Auto Loader the default is 1000. (Note that this option is also present in Apache Spark for other file sources, where there is no max by default.)

Limit input rate with maxBytesPerTrigger

Setting maxBytesPerTrigger (or cloudFiles.maxBytesPerTrigger for Auto Loader) sets a “soft max” for the amount of data processed in each micro-batch. This means that a batch processes approximately this amount of data and may process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit. There is no default for this setting.

For example, if you specify a byte string such as 10g to limit each microbatch to 10 GB of data and you have files that are 3 GB each, Databricks processes 12 GB in a microbatch.

Setting multiple input rates together

If you use maxBytesPerTrigger in conjunction with maxFilesPerTrigger, the micro-batch processes data until reaching the lower limit of either maxFilesPerTrigger or maxBytesPerTrigger.

Limiting input rates for other Structured Streaming sources

Streaming sources such as Apache Kafka each have custom input limits, such as maxOffsetsPerTrigger. For more details, see Working with pub/sub and message queues on Databricks.