Set up scheduler pools for multiple Structured Streaming workloads

To enable multiple streaming queries to execute jobs concurrently on a shared cluster, you can configure queries to execute in separate scheduler pools.

How do scheduler pools work?

By default, all queries started in a notebook run in the same fair scheduling pool. Jobs generated by triggers from all of the streaming queries in a notebook run one after another in first in, first out (FIFO) order. This can cause unnecessary delays in the queries, because they are not efficiently sharing the cluster resources.

Scheduler pools allow you to declare which Structured Streaming queries share compute resources.

The following example assigns query1 to a dedicated pool, while query2 and query3 share a scheduler pool.

# Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("delta").start(path1)

# Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("delta").start(path2)

# Run streaming query3 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query3").format("delta").start(path3)

Note

The local property configuration must be in the same notebook cell where you start your streaming query.

See Apache fair scheduler documentation for more details.