Delta Lake is deeply integrated with Spark Structured Streaming through
writeStream. Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including:
- Coalescing small files produced by low latency ingest
- Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)
- Efficiently discovering which files are new when using files as the source for a stream
When you load a Delta table as a stream source and use it in a streaming query, the query processes all of the data present in the table as well as any new data that arrives after the stream is started.
You can load both paths and tables as a stream.
You can also:
- Control the maximum size of any micro-batch that Delta Lake gives to streaming by setting the
maxFilesPerTriggeroption. This specifies the maximum number of new files to be considered in every trigger. The default is 1000.
- Rate-limit how much data gets processed in each micro-batch by setting the
maxBytesPerTriggeroption. This sets a “soft max,” meaning that a batch processes approximately this amount of data and may process more than the limit. If you use
Trigger.Oncefor your streaming, this option is ignored. If you use this option in conjunction with
maxFilesPerTrigger, the micro-batch processes data until either the
maxBytesPerTriggerlimit is reached.
Structured Streaming does not handle input that is not an append and throws an exception if any modifications occur on the table being used as a source. There are two main strategies for dealing with changes that cannot be automatically propagated downstream:
- You can delete the output and checkpoint and restart the stream from the beginning.
- You can set either of these two options:
ignoreDeletes: ignore transactions that delete data at partition boundaries.
ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as
DELETE(within partitions), or
OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream.
ignoreDeletes. Therefore if you use
ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.
For example, suppose you have a table
action columns that is partitioned by
date. You stream out of the
user_events table and you need to delete data from it due to GDPR.
When you delete at partition boundaries (that is, the
WHERE is on a partition column), the files are already segmented by value
so the delete just drops those files from the metadata. Thus, if you just want to delete data from some partitions, you can use:
events.readStream .format("delta") .option("ignoreDeletes", "true") .load("/mnt/delta/user_events")
However, if you have to delete data based on
user_email, then you will need to use:
events.readStream .format("delta") .option("ignoreChanges", "true") .load("/mnt/delta/user_events")
If you update a
user_email with the
UPDATE statement, the file containing the
user_email in question is rewritten. When you use
ignoreChanges, the new record is propagated downstream with all other unchanged records that were in the same file. Your logic should be able to handle these incoming duplicate records.
This feature is available on Databricks Runtime 7.3 and above.
You can use the following options to specify the starting point of the Delta Lake streaming source without processing the entire table.
startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the
versioncolumn of the command `DESCRIBE HISTORY events` output.
startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. It can be any one of:
'2018-10-18T22:15:12.013Z', that is, a string that can be cast to a timestamp
cast('2018-10-18 13:36:32 CEST' as timestamp)
'2018-10-18', that is, a date string
- Any other expression that is or can be cast to a timestamp, such as
current_timestamp() - interval 12 hours,
You cannot set both options at the same time; you need to use only one of them. They take effect only when starting a new streaming query. If a streaming query has started and the progress has been recorded in its checkpoint, these options are ignored.
Although you can start the streaming source from a specified version or timestamp, the schema of the streaming source is always the latest schema of the Delta table. You must ensure there is no incompatible schema change to the Delta table after the specified version or timestamp. Otherwise the streaming source may return incorrect results when reading the data with an incorrect schema.
For example, suppose you have a table
user_events. If you want to read changes since version 5, you can use:
events.readStream .format("delta") .option("startingVersion", "5") .load("/mnt/delta/user_events")
If you want to read changes since 2018-10-18, you can use:
events.readStream .format("delta") .option("startingTimestamp", "2018-10-18") .load("/mnt/delta/user_events")
You can also write data into a Delta table using Structured Streaming. The transaction log enables Delta Lake to guarantee exactly-once processing, even when there are other streams or batch queries running concurrently against the table.
By default, streams run in append mode, which adds new records to the table.
You can use the path method:
events.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/delta/events/_checkpoints/etl-from-json") .start("/delta/events")
events.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/mnt/delta/events/_checkpoints/etl-from-json") .start("/mnt/delta/events")
or table method:
events.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/delta/events/_checkpoints/etl-from-json") .table("events")
events.writeStream .outputMode("append") .option("checkpointLocation", "/mnt/delta/events/_checkpoints/etl-from-json") .table("events")
You can also use Structured Streaming to replace the entire table with every batch. One example use case is to compute a summary using aggregation:
spark.readStream .format("delta") .load("/mnt/delta/events") .groupBy("customerId") .count() .writeStream .format("delta") .outputMode("complete") .option("checkpointLocation", "/mnt/delta/eventsByCustomer/_checkpoints/streaming-agg") .start("/mnt/delta/eventsByCustomer")
The preceding example continuously updates a table that contains the aggregate number of events by customer.
For applications with more lenient latency requirements, you can save computing resources with one-time triggers. Use these to update summary aggregation tables on a given schedule, processing only new data that has arrived since the last update.