Optimize performance of stateful Structured Streaming queries on Databricks

Managing the intermediate state information of stateful Structured Streaming queries can help prevent unexpected latency and production problems.

Preventing slow down from garbage collection (GC) pause in stateful streaming

If you have stateful operations in your streaming query (such as streaming aggregation) and you want to maintain millions of keys in the state, then you may face issues related to large JVM garbage collection (GC) pauses. This causes high variations in the micro-batch processing times. This occurs because your JVM’s memory maintains your state data by default. Having a large number of state objects puts pressure on your JVM memory, which causes high GC pauses.

In such cases, you can choose to use a more optimized state management solution based on RocksDB. This solution is available in Databricks Runtime. Rather than keeping the state in the JVM memory, this solution uses RocksDB to efficiently manage the state in the native memory and the local SSD. Furthermore, any changes to this state are automatically saved by Structured Streaming to the checkpoint location you have provided, thus providing full fault-tolerance guarantees (the same as default state management). For instructions for configuring RocksDB as state store, see Configure RocksDB state store on Databricks.