Real-time mode limitations
This page describes known limitations for real-time mode in Structured Streaming.
Source limitations
For Kinesis, real-time mode doesn't support polling mode. Also, frequent repartitions might negatively impact latency.
Union limitations
The Union operator has some limitations:
- Self-union is not supported:
- For Kafka, you can't use the same source data frame object and union derived data frames from it. As a workaround, use different DataFrames that read from the same source.
- For Kinesis, you can't union data frames derived from the same Kinesis source with the same configuration. As a workaround, instead of using different DataFrames, you can assign a different
consumerNameoption to each DataFrame.
- Stateful operators (for example,
aggregate,deduplicate,transformWithState) can't be defined before the Union. - Union with batch sources is not supported.
mapPartitions limitation
mapPartitions in Scala and similar Python APIs (mapInPandas, mapInArrow) takes an iterator of the entire input partition and produces an iterator of the entire output with arbitrary mapping between input and output. These APIs can cause performance issues in real-time mode by blocking the entire output, which increases latency. The semantics of these APIs don't support watermark propagation well.
Use scalar UDFs combined with Transform complex data types or filter instead to achieve similar functionality.