Skip to main content

Load and process data incrementally with Lakeflow Spark Declarative Pipelines flows

Data is processed in pipelines through flows. Each flow consists of a query and, typically, a target. The flow processes the query, either as a batch, or incrementally as a stream of data into the target. A flow lives within a pipeline in Lakeflow Spark Declarative Pipelines.

Typically, flows are defined automatically when you create a query in a pipeline that updates a target, but you can also explicitly define additional flows for more complex processing, such as appending to a single target from multiple sources.

Updates

A flow is run each time that its defining pipeline is updated. The flow will create or update tables with the most recent data available. Depending on the type of flow and the state of the changes to the data, the update may perform an incremental refresh, which processes only new records, or perform a full refresh, that reprocesses all records from the data source.

Default flows and append flows

When you create a query in a pipeline that updates a target, a default flow is defined automatically. For a streaming table, the default flow is an append flow that adds new rows with each update, and it has the same name as the target. Creating a flow and its target in a single step is the most common way to use pipelines, and you can use it to ingest or transform data.

You can also define flows separately from a target, which lets multiple flows append data to a single target. This is useful when you need to:

  • Add streaming sources that append to an existing streaming table without requiring a full refresh.
  • Backfill a streaming table with missing historical data.
  • Combine data from multiple sources without using a UNION clause.

For examples of creating default and explicit flows, see Use flows in Lakeflow Spark Declarative Pipelines.

Types of flows

The default flows for streaming tables and materialized views are append flows. You can also create flows to read from change data capture data sources. The following table describes the different types of flows.

Flow type

Description

Append

Append flows are the most common type of flow, where new records in the source are written to the target with each update. They correspond to append mode in structured streaming. You can add the ONCE flag, indicating a batch query whose data should be inserted into the target only once, unless the target is fully refreshed. Any number of append flows can write to a particular target.

Default flows (created with the target streaming table or materialized view) will have the same name as the target. Other targets do not have default flows.

Auto CDC (previously apply changes)

An Auto CDC flow ingests a query containing change data capture (CDC) data. Auto CDC flows can only target streaming tables, and the source must be a streaming source (even in the case of ONCE flows). Multiple auto CDC flows can target a single streaming table. A streaming table that acts as a target for an auto CDC flow can only be targeted by other auto CDC flows.

For more information about CDC data, see The AUTO CDC APIs: Simplify change data capture with pipelines.

Update (Public Preview)

Update flows output global, non-watermarked streaming aggregates to a sink, emitting only the records that changed in each batch.

Update flows are only available in Python. See update_flow.

Additional resources

For more information on flows and their use, see the following topics: