Configure Structured Streaming batch size on Databricks

This article explains how to use admission controls to maintain a consistent batch size for streaming queries.

Admission controls limit the input rate for Structured Streaming queries, which can help maintain a consistent batch size and prevent large batches from causing spill and cascading micro-batch processing delays.

Databricks provides the same options to control Structured Streaming batch sizes for both Delta Lake and Auto Loader.

note

You can modify admission control settings without resetting the checkpoint for a streaming query. See Recover after changes in a Structured Streaming query.

Changing admission control settings to increase or decrease batch size has performance implications. To optimize your workload, you might need to adjust your compute configurations.

Limit input rate with maxFilesPerTrigger

Setting maxFilesPerTrigger (or cloudFiles.maxFilesPerTrigger for Auto Loader) specifies an upper-bound for the number of files processed in each micro-batch. For both Delta Lake and Auto Loader the default is 1000. (Note that this option is also present in Apache Spark for other file sources, where there is no max by default.)

Limit input rate with maxBytesPerTrigger

Setting maxBytesPerTrigger (or cloudFiles.maxBytesPerTrigger for Auto Loader) sets a “soft max” for the amount of data processed in each micro-batch. This means that a batch processes approximately this amount of data and may process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit. There is no default for this setting.

For example, if you specify a byte string such as 10g to limit each microbatch to 10 GB of data and you have files that are 3 GB each, Databricks processes 12 GB in a microbatch.

Setting multiple input rates together

If you use maxBytesPerTrigger in conjunction with maxFilesPerTrigger, the micro-batch processes data until reaching the lower limit of either maxFilesPerTrigger or maxBytesPerTrigger.

Limiting input rates for other Structured Streaming sources

Streaming sources such as Apache Kafka each have custom input limits, such as maxOffsetsPerTrigger. For more details, see Standard connectors in Lakeflow Connect.

Limit input rate with maxFilesPerTrigger​

Limit input rate with maxBytesPerTrigger​

Setting multiple input rates together​

Limiting input rates for other Structured Streaming sources​

Limit input rate with maxFilesPerTrigger

Limit input rate with maxBytesPerTrigger

Setting multiple input rates together

Limiting input rates for other Structured Streaming sources