Structured Streaming provides fault-tolerance and data consistency for streaming queries; using Databricks workflows, you can easily configure your Structured Streaming queries to automatically restart on failure. By enabling checkpointing for a streaming query, you can restart the query after a failure. The restarted query continues where the failed one left off.
Databricks recommends that you always specify the
checkpointLocation option a cloud storage path before you start the query. For example:
streamingDataFrame.writeStream .format("parquet") .option("path", "/path/to/table") .option("checkpointLocation", "/path/to/table/_checkpoint") .start()
This checkpoint location preserves all of the essential information that identifies a query. Each query must have a different checkpoint location. Multiple queries should never have the same location. For more information, see the Structured Streaming Programming Guide.
checkpointLocation is required for most types of output sinks, some sinks, such as memory sink, may automatically generate a temporary checkpoint location when you do not provide
checkpointLocation. These temporary checkpoint locations do not ensure any fault tolerance or data consistency guarantees and may not get cleaned up properly. Avoid potential pitfalls by always specifying a
You can create a Databricks job with the notebook or JAR that has your streaming queries and configure it to:
Always use a new cluster.
Always retry on failure.
Automatically restarting on job failure is especially important when configuring streaming workloads with schema evolution. Schema evolution works on Databricks by raising an expected error when a schema change is detected, and then properly processing data using the new schema when the job restarts. Databricks recommends always configuring streaming tasks that contain queries with schema evolution to restart automatically in Databricks workflows.
Jobs have tight integration with Structured Streaming APIs and can monitor all streaming queries active in a run. This configuration ensures that if any part of the query fails, jobs automatically terminate the run (along with all the other queries) and start a new run in a new cluster. This re-runs the notebook or JAR code and restarts all of the queries again. This is the safest way to return to a good state.
Failure in any of the active streaming queries causes the active run to fail and terminate all the other streaming queries.
You do not need to use
spark.streams.awaitAnyTermination()at the end of your notebook. Jobs automatically prevent a run from completing when a streaming query is active.
Databricks recommends using jobs instead of
dbutils.notebook.run()when orchestrating Structured Streaming notebooks. See Run a Databricks notebook from another notebook.
The following is an example of a recommended job configuration.
Cluster: Set this always to use a new cluster and use the latest Spark version (or at least version 2.1). Queries started in Spark 2.1 and above are recoverable after query and Spark version upgrades.
Notifications: Set this if you want email notification on failures.
Schedule: Do not set a schedule.
Timeout: Do not set a timeout. Streaming queries run for an indefinitely long time.
Maximum concurrent runs: Set to 1. There must be only one instance of each query concurrently active.
Retries: Set to Unlimited.
See Create and run Databricks Jobs to understand these configurations.
There are limitations on what changes in a streaming query are allowed between restarts from the same checkpoint location. Here are a few changes that are either not allowed or the effect of the change is not well-defined. For all of them:
The term allowed means you can do the specified change but whether the semantics of its effect is well-defined depends on the query and the change.
The term not allowed means you should not do the specified change as the restarted query is likely to fail with unpredictable errors.
sdfrepresents a streaming DataFrame/Dataset generated with
Changes in the number or type (that is, different source) of input sources: This is not allowed.
Changes in the parameters of input sources: Whether this is allowed and whether the semantics of the change are well-defined depends on the source and the query. Here are a few examples.
Addition, deletion, and modification of rate limits is allowed:
spark.readStream.format("kafka").option("subscribe", "article").option("maxOffsetsPerTrigger", ...)
Changes to subscribed articles and files are generally not allowed as the results are unpredictable:
Changes in the trigger interval: You can change triggers between incremental batches and time intervals. See Changing trigger intervals between runs.
Changes in the type of output sink: Changes between a few specific combinations of sinks are allowed. This needs to be verified on a case-by-case basis. Here are a few examples.
File sink to Kafka sink is allowed. Kafka will see only the new data.
Kafka sink to file sink is not allowed.
Kafka sink changed to foreach, or vice versa is allowed.
Changes in the parameters of output sink: Whether this is allowed and whether the semantics of the change are well-defined depends on the sink and the query. Here are a few examples.
Changes to output directory of a file sink is not allowed:
Changes to output topic is allowed:
Changes to the user-defined foreach sink (that is, the
ForeachWritercode) is allowed, but the semantics of the change depends on the code.
Changes in projection / filter / map-like operations: Some cases are allowed. For example:
Addition / deletion of filters is allowed:
Changes in projections with same output schema is allowed:
sdf.selectExpr("stringColumn AS json").writeStreamto
Changes in projections with different output schema are conditionally allowed:
sdf.selectExpr("b").writeStreamis allowed only if the output sink allows the schema change from
Changes in stateful operations: Some operations in streaming queries need to maintain state data in order to continuously update the result. Structured Streaming automatically checkpoints the state data to fault-tolerant storage (for example, DBFS, AWS S3, Azure Blob storage) and restores it after restart. However, this assumes that the schema of the state data remains same across restarts. This means that any changes (that is, additions, deletions, or schema modifications) to the stateful operations of a streaming query are not allowed between restarts. Here is the list of stateful operations whose schema should not be changed between restarts in order to ensure state recovery:
Streaming aggregation: For example,
sdf.groupBy("a").agg(...). Any change in number or type of grouping keys or aggregates is not allowed.
Streaming deduplication: For example,
sdf.dropDuplicates("a"). Any change in number or type of grouping keys or aggregates is not allowed.
Stream-stream join: For example,
sdf1.join(sdf2, ...)(i.e. both inputs are generated with
sparkSession.readStream). Changes in the schema or equi-joining columns are not allowed. Changes in join type (outer or inner) not allowed. Other changes in the join condition are ill-defined.
Arbitrary stateful operation: For example,
sdf.groupByKey(...).flatMapGroupsWithState(...). Any change to the schema of the user-defined state and the type of timeout is not allowed. Any change within the user-defined state-mapping function are allowed, but the semantic effect of the change depends on the user-defined logic. If you really want to support state schema changes, then you can explicitly encode/decode your complex state data structures into bytes using an encoding/decoding scheme that supports schema migration. For example, if you save your state as Avro-encoded bytes, then you can change the Avro-state-schema between query restarts as this restores the binary state.