Stream-Stream Joins using Structured Streaming (Python)

This notebook illustrates different ways of joining streams.

We are going to use the the canonical example of ad monetization, where we want to find out which ad impressions led to user clicks. Typically, in such scenarios, there are two streams of data from different sources - ad impressions and ad clicks. Both type of events have a common ad identifier (say, adId), and we want to match clicks with impressions based on the adId. In addition, each event also has a timestamp, which we will use to specify additional conditions in the query to limit the streaming state.

from pyspark.sql.functions import rand

spark.conf.set("spark.sql.shuffle.partitions", "1")

impressions = (
  spark
    .readStream.format("rate").option("rowsPerSecond", "5").option("numPartitions", "1").load()
    .selectExpr("value AS adId", "timestamp AS impressionTime")
)

clicks = (
  spark
  .readStream.format("rate").option("rowsPerSecond", "5").option("numPartitions", "1").load()
  .where((rand() * 100).cast("integer") < 10)      # 10 out of every 100 impressions result in a click
  .selectExpr("(value - 50) AS adId ", "timestamp AS clickTime")      # -50 so that a click with same id as impression is generated later (i.e. delayed data).
  .where("adId > 0")
)

display(impressions)

display_query_1(id: 51bec3b2-0375-4204-96d1-377f173e8076)

Last updated: 2681 days ago

adId

impressionTime

2018-03-06T04:00:40.161+0000

2018-03-06T04:00:40.361+0000

2018-03-06T04:00:40.561+0000

2018-03-06T04:00:40.761+0000

2018-03-06T04:00:40.961+0000

2018-03-06T04:00:41.161+0000

2018-03-06T04:00:41.361+0000

2018-03-06T04:00:41.561+0000

2018-03-06T04:00:41.761+0000

2018-03-06T04:00:41.961+0000

2018-03-06T04:00:42.161+0000

2018-03-06T04:00:42.361+0000

2018-03-06T04:00:42.561+0000

2018-03-06T04:00:42.761+0000

2018-03-06T04:00:42.961+0000

2018-03-06T04:00:43.161+0000

2018-03-06T04:00:43.361+0000

2018-03-06T04:00:43.561+0000

Showing all 35 rows.

display(clicks)

display_query_2(id: a1215ca0-3c96-4c5a-ae9b-b65759f5b86d)

Last updated: 2681 days ago

display(impressions.join(clicks, "adId"))

display_query_3(id: 696870f8-9e00-4439-8dc0-fa192c200bba)

Last updated: 2681 days ago

adId

impressionTime

clickTime

2018-03-06T04:01:50.442+0000

2018-03-06T04:02:01.564+0000

2018-03-06T04:01:52.242+0000

2018-03-06T04:02:03.364+0000

2018-03-06T04:01:54.442+0000

2018-03-06T04:02:05.564+0000

2018-03-06T04:01:56.242+0000

2018-03-06T04:02:07.364+0000

2018-03-06T04:01:58.442+0000

2018-03-06T04:02:09.564+0000

2018-03-06T04:02:00.242+0000

2018-03-06T04:02:11.364+0000

2018-03-06T04:02:02.442+0000

2018-03-06T04:02:13.564+0000

2018-03-06T04:02:05.442+0000

2018-03-06T04:02:16.564+0000

2018-03-06T04:02:07.442+0000

2018-03-06T04:02:18.564+0000

101

2018-03-06T04:02:10.442+0000

2018-03-06T04:02:21.564+0000

116

2018-03-06T04:02:13.442+0000

2018-03-06T04:02:24.564+0000

131

2018-03-06T04:02:16.442+0000

2018-03-06T04:02:27.564+0000

146

2018-03-06T04:02:19.442+0000

2018-03-06T04:02:30.564+0000

161

2018-03-06T04:02:22.442+0000

2018-03-06T04:02:33.564+0000

176

2018-03-06T04:02:25.442+0000

2018-03-06T04:02:36.564+0000

185

2018-03-06T04:02:27.242+0000

2018-03-06T04:02:38.364+0000

196

2018-03-06T04:02:29.442+0000

2018-03-06T04:02:40.564+0000

205

2018-03-06T04:02:31.242+0000

2018-03-06T04:02:42.364+0000

Showing all 31 rows.

After you start this query, within a minute, you will start getting joined impressions and clicks. The delays of a minute is due to the fact that clicks are being generated with delay over the corresponding impressions.

In addition, if you expand the details of the query above, you will find a few timelines of query metrics - the processing rates, the micro-batch durations, and the size of the state. If you keep running this query, you will notice that the state will keep growing in an unbounded manner. This is because the query must buffer all past input as any new input can match with any input from the past.

Inner Join with Watermarking

To avoid unbounded state, you have to define additional join conditions such that indefinitely old inputs cannot match with future inputs and therefore can be cleared from the state. In other words, you will have to do the following additional steps in the join.

Define watermark delays on both inputs such that the engine knows how delayed the input can be.
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input. This constraint can be defined in one of the two ways.

a. Time range join conditions (e.g. ...JOIN ON leftTime BETWEN rightTime AND rightTime + INTERVAL 1 HOUR),

b. Join on event-time windows (e.g. ...JOIN ON leftTimeWindow = rightTimeWindow).

Let's apply these steps to our use case.

Watermark delays: Say, the impressions and the corresponding clicks can be delayed/late in event-time by at most "10 seconds" and "20 seconds", respectively. This is specified in the query as watermarks delays using withWatermark.
Event-time range condition: Say, a click can occur within a time range of 0 seconds to 1 minute after the corresponding impression. This is specified in the query as a join condition between impressionTime and clickTime.

from pyspark.sql.functions import expr

# Define watermarks
impressionsWithWatermark = impressions \
  .selectExpr("adId AS impressionAdId", "impressionTime") \
  .withWatermark("impressionTime", "10 seconds ")
clicksWithWatermark = clicks \
  .selectExpr("adId AS clickAdId", "clickTime") \
  .withWatermark("clickTime", "20 seconds")        # max 20 seconds late


# Inner join with time range conditions
display(
  impressionsWithWatermark.join(
    clicksWithWatermark,
    expr(""" 
      clickAdId = impressionAdId AND 
      clickTime >= impressionTime AND 
      clickTime <= impressionTime + interval 1 minutes    
      """
    )
  )
)

display_query_4(id: 20d22a05-ddac-44cf-9082-9e256b811ee9)

Last updated: 2681 days ago

impressionAdId

impressionTime

clickAdId

clickTime

2018-03-06T04:03:50.266+0000

2018-03-06T04:04:01.613+0000

2018-03-06T04:03:52.466+0000

2018-03-06T04:04:03.813+0000

2018-03-06T04:03:54.266+0000

2018-03-06T04:04:05.613+0000

2018-03-06T04:03:56.466+0000

2018-03-06T04:04:07.813+0000

2018-03-06T04:03:59.466+0000

2018-03-06T04:04:10.813+0000

2018-03-06T04:04:01.266+0000

2018-03-06T04:04:12.613+0000

2018-03-06T04:04:03.466+0000

2018-03-06T04:04:14.813+0000

2018-03-06T04:04:06.466+0000

2018-03-06T04:04:17.813+0000

2018-03-06T04:04:08.266+0000

2018-03-06T04:04:19.613+0000

106

2018-03-06T04:04:10.466+0000

106

2018-03-06T04:04:21.813+0000

121

2018-03-06T04:04:13.466+0000

121

2018-03-06T04:04:24.813+0000

136

2018-03-06T04:04:16.466+0000

136

2018-03-06T04:04:27.813+0000

151

2018-03-06T04:04:19.466+0000

151

2018-03-06T04:04:30.813+0000

166

2018-03-06T04:04:22.466+0000

166

2018-03-06T04:04:33.813+0000

181

2018-03-06T04:04:25.466+0000

181

2018-03-06T04:04:36.813+0000

191

2018-03-06T04:04:27.466+0000

191

2018-03-06T04:04:38.813+0000

206

2018-03-06T04:04:30.466+0000

206

2018-03-06T04:04:41.813+0000

221

2018-03-06T04:04:33.466+0000

221

2018-03-06T04:04:44.813+0000

Showing all 41 rows.

Outer Joins with Watermarking

Let's extend this use case to illustrate outer joins. Not all ad impressions will lead to clicks and you may want to keep track of impressions that did not produce clicks. This can be done by applying a left outer join on the impressions and clicks. The joined output will not have the matched clicks, but also the unmatched ones (with clicks data being NULL).

While the watermark + event-time constraints is optional for inner joins, for left and right outer joins they must be specified. This is because for generating the NULL results in outer join, the engine must know when an input row is not going to match with anything in future. Hence, the watermark + event-time constraints must be specified for generating correct results.

from pyspark.sql.functions import expr

# Inner join with time range conditions
display(
  impressionsWithWatermark.join(
    clicksWithWatermark,
    expr(""" 
      clickAdId = impressionAdId AND 
      clickTime >= impressionTime AND 
      clickTime <= impressionTime + interval 1 minutes    
      """
    ),
    "leftOuter"
  )
)

display_query_6(id: 9cc15300-626d-4c79-88f2-732efc432e4d)

Last updated: 2681 days ago

impressionAdId

impressionTime

clickAdId

clickTime

2018-03-06T04:30:53.384+0000

2018-03-06T04:31:04.334+0000

2018-03-06T04:30:55.184+0000

2018-03-06T04:31:06.134+0000

2018-03-06T04:30:57.384+0000

2018-03-06T04:31:08.334+0000

2018-03-06T04:30:59.184+0000

2018-03-06T04:31:10.134+0000

2018-03-06T04:31:01.384+0000

2018-03-06T04:31:12.334+0000

2018-03-06T04:31:04.384+0000

2018-03-06T04:31:15.334+0000

2018-03-06T04:31:07.384+0000

2018-03-06T04:31:18.334+0000

2018-03-06T04:31:10.384+0000

2018-03-06T04:31:21.334+0000

106

2018-03-06T04:31:13.384+0000

106

2018-03-06T04:31:24.334+0000

121

2018-03-06T04:31:16.384+0000

121

2018-03-06T04:31:27.334+0000

130

2018-03-06T04:31:18.184+0000

130

2018-03-06T04:31:29.134+0000

141

2018-03-06T04:31:20.384+0000

141

2018-03-06T04:31:31.334+0000

156

2018-03-06T04:31:23.384+0000

156

2018-03-06T04:31:34.334+0000

171

2018-03-06T04:31:26.384+0000

171

2018-03-06T04:31:37.334+0000

180

2018-03-06T04:31:28.184+0000

180

2018-03-06T04:31:39.134+0000

191

2018-03-06T04:31:30.384+0000

191

2018-03-06T04:31:41.334+0000

206

2018-03-06T04:31:33.384+0000

206

2018-03-06T04:31:44.334+0000

215

2018-03-06T04:31:35.184+0000

215

2018-03-06T04:31:46.134+0000

226

2018-03-06T04:31:37.384+0000

226

2018-03-06T04:31:48.334+0000

235

2018-03-06T04:31:39.184+0000

235

2018-03-06T04:31:50.134+0000

246

2018-03-06T04:31:41.384+0000

246

2018-03-06T04:31:52.334+0000

255

2018-03-06T04:31:43.184+0000

255

2018-03-06T04:31:54.134+0000

266

2018-03-06T04:31:45.384+0000

266

2018-03-06T04:31:56.334+0000

275

2018-03-06T04:31:47.184+0000

275

2018-03-06T04:31:58.134+0000

280

2018-03-06T04:31:48.184+0000

280

2018-03-06T04:31:59.134+0000

282

2018-03-06T04:31:48.584+0000

282

2018-03-06T04:31:59.534+0000

291

2018-03-06T04:31:50.384+0000

291

2018-03-06T04:32:01.334+0000

300

2018-03-06T04:31:52.184+0000

300

2018-03-06T04:32:03.134+0000

305

2018-03-06T04:31:53.184+0000

305

2018-03-06T04:32:04.134+0000

Showing all 163 rows.

Stream-Stream Joins using Structured Streaming (Python)

Inner Join

Inner Join with Watermarking

Outer Joins with Watermarking

Further Information