Skip to main content

How do pipelines refresh?

When a pipeline update runs, it refreshes the materialized views and streaming tables defined in the pipeline so their results reflect the current state of the source data. How a dataset refreshes depends on the dataset type and the type of refresh. This page explains the refresh concepts shared across Lakeflow Spark Declarative Pipelines. For how to trigger and manage updates, see Run a pipeline update.

Refresh types

By default, every materialized view and streaming table in a pipeline refreshes with each update. The following table summarizes how each refresh type behaves:

Update type

Materialized view

Streaming table

Refresh (default)

Updates results to reflect the current results of the defining query. Databricks examines the cost and performs an incremental refresh when it is more efficient.

Processes new records through the logic defined in streaming tables and flows.

Full refresh

Recomputes results to reflect the current results of the defining query.

Clears data from streaming tables, clears checkpoints from flows, and reprocesses all records from the data source.

Reset streaming flow checkpoints

Not applicable to materialized views.

Clears checkpoints from flows but does not clear data from streaming tables, then reprocesses all records from the data source.

Refresh (default)

A default refresh updates a dataset to reflect the current results of its defining query.

Streaming tables are inherently incremental. A streaming table refresh evaluates only the records that arrived since the last update and appends them, using the current definition of the table. Older records are not reprocessed, so changes that would affect already-written data are not applied. In other words, a default refresh of a streaming table trades data correctness for lower time and resource costs. To reprocess older data, run a full refresh or reset the flow checkpoints.

Materialized views attempt an incremental refresh but reprocess all records when necessary to keep the table fully accurate. A materialized view is refreshed using one of two methods:

  • Incremental refresh identifies the changes since the last update and merges only the new or modified data.
  • Full refresh runs the entire query and replaces the existing data when an incremental refresh isn't possible or isn't cost-effective.

By default, Databricks uses a cost model to choose the more cost-effective method. You can override this choice with a refresh policy. For the semantics, requirements, and supported SQL for incremental refresh, see Incremental refresh for materialized views.

Full refresh

A full refresh reprocesses all records from the source data through the logic that defines the dataset:

  • For a materialized view, a full refresh recomputes the entire result. Because materialized views always return the same result as a batch query, a default refresh and a full refresh produce identical data.
  • For a streaming table, a full refresh truncates the table, clears the streaming checkpoints for its flows, and reprocesses every record from the source.

Because a full refresh reprocesses all source data, the time and cost scale with the size of that data. Databricks recommends running a full refresh only when necessary, such as when a definition or schema change is not compatible with the existing data. A full refresh of a streaming table can drop records if the source no longer retains the original data, for example a Kafka topic past its retention window.

For when and how to run a full refresh of a streaming table, see Full refresh for streaming tables.

Reset checkpoints

Resetting checkpoints applies only to streaming tables. It clears the streaming checkpoints for selected flows without clearing the data already written to the streaming table, then reprocesses all records from the source through those flows. Unlike a full refresh, the existing table data is retained.

Use this when you want to reprocess a streaming source for selected flows, for example after changing a flow's logic, without truncating the table.

Resetting checkpoints is triggered through the Lakeflow Spark Declarative Pipelines REST API. For the steps, see Start a pipeline update to clear selective streaming flows' checkpoints.

Additional resources