Table Streaming Reads and Writes

Databricks Delta is deeply integrated with Spark Structured Streaming through readStream and writeStream. Databricks Delta overcomes many of the limitations typically associated with streaming systems and files, including:

  • Coalescing small files produced by low latency ingest.
  • Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs).
  • Efficiently discovering which files are new when using files as the source for a stream.

As a source

When you load a Databricks Delta table as a stream source and use it in a streaming query, the query processes all of the data present in the table as well as any new data that arrives after the stream is started. You can load both paths and tables as a stream.

spark.readStream.format("delta").load("/delta/events")

or

spark.readStream.format("delta").table("events")

You can also control the maximum size of any micro-batch that Databricks Delta gives to streaming by setting the maxFilesPerTrigger option. This specifies the maximum number of new files to be considered in every trigger. The default is 1000.

Ignoring updates and deletes

Structured Streaming does not handle input that is not an append and throws an exception if any modifications occur on the table being used as a source. There are two main strategies for dealing with changes that cannot be propagated downstream automatically:

  • Since Databricks Delta tables retain all history by default, in many cases you can delete the output and checkpoint and restart the stream from the beginning.
  • You can set either of these two options:
    • ignoreDeletes skips any transaction that deletes data from the source table. For example, you can set this option if you delete data from the source table and you do not want these deletions to propagate downstream.
    • ignoreChanges skips any transaction that updates data in the source table. For example, you can set this option if you use UPDATE, MERGE, or OVERWRITE and these changes do not need to be propagated downstream.

As a sink

You can also write data into a Databricks Delta table using Structured Streaming. The transaction log enables Databricks Delta to guarantee exactly-once processing, even when there are other streams or batch queries running concurrently against the table.

Append mode

By default, streams run in append mode, which adds new records to the table. You can use the path or table method:

events.writeStream
  .format("delta")
  .outputMode("append")
  .option("checkpointLocation", "/delta/events/_checkpoints/etl-from-json")
  .start("/delta/events") // as a path

or

events.writeStream
  .outputMode("append")
  .option("checkpointLocation", "/delta/events/_checkpoints/etl-from-json")
  .table("events")

Complete mode

You can also use Structured Streaming to replace the entire table with every batch. One example use case is to compute a summary using aggregation:

spark.readStream
  .table("events")
  .groupBy("customerId")
  .count()
  .writeStream
  .format("delta")
  .outputMode("complete")
  .option("checkpointLocation", "/delta/eventsByCustomer/_checkpoints/streaming-agg")
  .start("/delta/eventsByCustomer")

The preceding example continuously updates a table that contains the aggregate number of events by customer.

For applications with more lenient latency requirements, you can save computing resources with one-time triggers. Use these to update summary aggregation tables on a given schedule, processing only new data that has arrived since the last update.