Configure Structured Streaming trigger intervals
Apache Spark Structured Streaming processes data incrementally; controlling the trigger interval for batch processing allows you to use Structured Streaming for workloads including near-real time processing, refreshing databases every 5 minutes or once per hour, or batch processing all new data for a day or week.
Because Databricks Auto Loader uses Structured Streaming to load data, understanding how triggers work provides you with the greatest flexibility to control costs while ingesting data with the desired frequency.
Specifying time-based trigger intervals
Structured Streaming refers to time-based trigger intervals as “fixed interval micro-batches”. Using the
processingTime keyword, specify a time duration as a string, such as
When you specify a
trigger interval that is too small (less than tens of seconds), the system may perform unnecessary checks to see if new data arrives. Configure your processing time to balance latency requirements and the rate that data arrives in the source.
Configuring incremental batch processing
In Databricks Runtime 11.3 LTS and above, the
Trigger.Once setting is deprecated. Databricks recommends you use
Trigger.AvailableNow for all incremental batch processing workloads.
Apache Spark provides the
.trigger(once=True) option to process all new data from the source directory as a single micro-batch. This trigger once pattern ignores all setting to control streaming input size, which can lead to massive spill or out-of-memory errors.
trigger(availableNow=True) in Databricks Runtime 10.2 and above for Delta Lake and Auto Loader sources. This functionality combines the batch processing approach of trigger once with the ability to configure batch size, resulting in multiple parallelized batches that give greater control for right-sizing batches and the resultant files.
What is the default trigger interval?
Structured Streaming defaults to fixed interval micro-batches of 500ms. Databricks recommends you always specify a tailored
trigger to minimize costs associated with checking if new data has arrived and processing undersized batches.
What is continuous processing mode?
Apache Spark supports an additional trigger interval known as Continuous Processing. This mode has been classified as experimental since Spark 2.3; consult with your Databricks representative to make sure you understand the trade-offs of this processing model.
Note that this continuous processing mode does not relate at all to continuous processing as applied in Delta Live Tables.