Stream-Stream Joins using Structured Streaming (Scala)

This notebook illustrates different ways of joining streams.

We are going to use the the canonical example of ad monetization, where we want to find out which ad impressions led to user clicks. Typically, in such scenarios, there are two streams of data from different sources - ad impressions and ad clicks. Both type of events have a common ad identifier (say, adId), and we want to match clicks with impressions based on the adId. In addition, each event also has a timestamp, which we will use to specify additional conditions in the query to limit the streaming state.

import org.apache.spark.sql.functions._

spark.conf.set("spark.sql.shuffle.partitions", "1")

val impressions = spark
  .readStream.format("rate").option("rowsPerSecond", "5").option("numPartitions", "1").load()
  .select($"value".as("adId"), $"timestamp".as("impressionTime"))
  
val clicks = spark
  .readStream.format("rate").option("rowsPerSecond", "5").option("numPartitions", "1").load()
  .where((rand() * 100).cast("integer") < 10)       // 10 out of every 100 impressions result in a click
  .select(($"value" - 50).as("adId"), $"timestamp".as("clickTime"))   // -100 so that a click with same id as impression is generated much later.
  .where("adId > 0")

import org.apache.spark.sql.functions._ impressions: org.apache.spark.sql.DataFrame = [adId: bigint, impressionTime: timestamp] clicks: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [adId: bigint, clickTime: timestamp]

display(impressions)

display_query_7(id: 02db5a6f-df32-449d-a9dd-209bda5497ed)

Last updated: 2681 days ago

adId

impressionTime

2018-03-06T04:32:09.076+0000

2018-03-06T04:32:09.276+0000

2018-03-06T04:32:09.476+0000

2018-03-06T04:32:09.676+0000

2018-03-06T04:32:09.876+0000

2018-03-06T04:32:10.076+0000

2018-03-06T04:32:10.276+0000

2018-03-06T04:32:10.476+0000

2018-03-06T04:32:10.676+0000

2018-03-06T04:32:10.876+0000

2018-03-06T04:32:11.076+0000

2018-03-06T04:32:11.276+0000

2018-03-06T04:32:11.476+0000

2018-03-06T04:32:11.676+0000

2018-03-06T04:32:11.876+0000

2018-03-06T04:32:12.076+0000

2018-03-06T04:32:12.276+0000

2018-03-06T04:32:12.476+0000

Showing all 25 rows.

display(clicks)

display_query_8(id: 9c6e7118-3bf7-460a-80e2-9a5e8dc8cb4a)

Last updated: 2681 days ago

adId

clickTime

2018-03-06T04:32:31.941+0000

2018-03-06T04:32:32.341+0000

2018-03-06T04:32:32.941+0000

2018-03-06T04:32:33.341+0000

2018-03-06T04:32:33.941+0000

2018-03-06T04:32:34.341+0000

2018-03-06T04:32:34.941+0000

2018-03-06T04:32:35.941+0000

2018-03-06T04:32:36.341+0000

2018-03-06T04:32:36.941+0000

2018-03-06T04:32:37.941+0000

2018-03-06T04:32:38.341+0000

2018-03-06T04:32:38.941+0000

2018-03-06T04:32:39.941+0000

2018-03-06T04:32:40.341+0000

2018-03-06T04:32:40.941+0000

2018-03-06T04:32:41.341+0000

2018-03-06T04:32:41.941+0000

Showing all 74 rows.

display(impressions.join(clicks, "adId"))

display_query_9(id: 417a5d17-7746-47b1-87fb-3a43a176c4fd)

Last updated: 2681 days ago

adId

impressionTime

clickTime

2018-03-06T04:33:30.031+0000

2018-03-06T04:33:41.372+0000

2018-03-06T04:33:31.031+0000

2018-03-06T04:33:42.372+0000

2018-03-06T04:33:31.631+0000

2018-03-06T04:33:42.972+0000

2018-03-06T04:33:32.031+0000

2018-03-06T04:33:43.372+0000

2018-03-06T04:33:32.431+0000

2018-03-06T04:33:43.772+0000

2018-03-06T04:33:33.031+0000

2018-03-06T04:33:44.372+0000

2018-03-06T04:33:34.031+0000

2018-03-06T04:33:45.372+0000

2018-03-06T04:33:34.631+0000

2018-03-06T04:33:45.972+0000

2018-03-06T04:33:35.031+0000

2018-03-06T04:33:46.372+0000

2018-03-06T04:33:35.431+0000

2018-03-06T04:33:46.772+0000

2018-03-06T04:33:36.031+0000

2018-03-06T04:33:47.372+0000

2018-03-06T04:33:37.031+0000

2018-03-06T04:33:48.372+0000

2018-03-06T04:33:37.631+0000

2018-03-06T04:33:48.972+0000

2018-03-06T04:33:38.031+0000

2018-03-06T04:33:49.372+0000

2018-03-06T04:33:38.431+0000

2018-03-06T04:33:49.772+0000

2018-03-06T04:33:39.031+0000

2018-03-06T04:33:50.372+0000

2018-03-06T04:33:40.031+0000

2018-03-06T04:33:51.372+0000

2018-03-06T04:33:40.631+0000

2018-03-06T04:33:51.972+0000

Showing all 94 rows.

Note the matched impressions and clicks (matched timestamps to be specific) that is continuously in the result table above.

In addition, if you expand the details of the query above, you will find a few timelines of query metrics - the processing rates, the micro-batch durations, and the size of the state. If you keep running this query, you will notice that the state will keep growing in an unbounded manner. This is because the query must buffer all past input as any new input can match with any input from the past.

Stream-Stream Joins using Structured Streaming (Scala)

Inner Join