Structured Streaming concepts

Apache Spark Structured Streaming is a near real-time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data. The Structured Streaming engine performs the computation incrementally and continuously updates the result as streaming data arrives.

For a step-by-step tutorial, see Run your first Structured Streaming workload.

Read from a data stream

Use Structured Streaming to incrementally ingest data from supported data sources.

- Auto Loader
- Incrementally and efficiently process new data files as they arrive in cloud storage.
- Delta Lake table streaming reads and writes
- Use Delta Lake tables as streaming sources and sinks with exactly-once processing guarantees.
- Standard connectors
- Connect to message buses, queues, and enterprise applications using standard connectors.
- Micro-batch size
- Limit input rates to maintain consistent batch sizes and prevent processing delays.

Write to a data sink

Configure how Structured Streaming delivers data to target systems.

- Checkpoints
- Store processing state to enable fault tolerance and exactly-once delivery semantics.
- Output mode
- Choose between append, update, and complete modes for stateful streaming queries.
- Trigger intervals
- Set trigger intervals to balance latency and cost for your processing requirements.
- Real-time mode in Structured Streaming
- Process data for real-time workloads with end-to-end latency as low as five milliseconds.

Stateful and stateless processing

Stateless queries process rows without retaining state. Stateful queries maintain intermediate state for aggregations, joins, and deduplication.

- Stateless streaming queries
- Optimize queries that process data without maintaining intermediate state.
- Watermarks
- Control how long Structured Streaming waits for late-arriving data in stateful operations.
- Stateful streaming
- Manage aggregations, stream-stream joins, and deduplication using stateful operators.

Monitor and manage

Track query performance, apply optimizations, and govern data access for production Structured Streaming workloads.

- Monitor with StreamingQueryListener
- Track query progress and performance metrics using the Spark UI and listener API.
- Govern with Unity Catalog
- Configure Unity Catalog for streaming workloads with governance and access control.

Read from a data stream​

Write to a data sink​

Stateful and stateless processing​

Monitor and manage​

Read from a data stream

Write to a data sink

Stateful and stateless processing

Monitor and manage