Skip to main content

REPLACE WHERE flows in Databricks SQL

Beta

REPLACE WHERE flows for streaming tables in Databricks SQL are in Beta.

This page describes how to use REPLACE WHERE flows to recompute and overwrite a targeted subset of a streaming table in Databricks SQL without reprocessing your entire table history. REPLACE WHERE flows handle late-arriving data, upstream reprocessing, schema evolution, and backfills.

With a REPLACE WHERE flow, you define a predicate on the target table. All rows matching the predicate are deleted, then the source query is re-evaluated for that predicate range and the new results are inserted. Rows that don't match the predicate are left untouched.

For the full feature reference, including Python syntax and predicate overrides for backfills, see REPLACE WHERE flows.

Requirements

Before you create a REPLACE WHERE flow in Databricks SQL, confirm the following:

  • Your streaming table must use the PREVIEW channel. Set the channel using the pipelines.channel table property:

    SQL
    CREATE STREAMING TABLE st_preview
    TBLPROPERTIES (pipelines.channel = 'PREVIEW')
    ...
  • Databricks recommends Unity Catalog and serverless compute for the best experience.

When to use REPLACE WHERE flows

REPLACE WHERE flows are well-suited for the following scenarios:

  • Incremental batch processing without streaming semantics: Process new rows in batches without managing streaming concepts such as watermarks.
  • Selective reprocessing: Recompute only rows that match a predicate while leaving all other rows untouched.
  • Scenarios beyond standard materialized view capabilities:
    • Target tables with longer retention than the source
    • Preventing recomputation when a dimension table changes
    • Schema evolution without recomputing entire history

Predicate design guidelines

Avoid REPLACE WHERE predicates on aggregates or derived columns. For example, a predicate like total_sales > 100000 where total_sales is a SUM() requires the engine to recompute the aggregation for all partitions on every run. Use predicates on base columns such as date or region so the engine can push down the filter to the source and process only the relevant data.

Create a REPLACE WHERE flow

Use the FLOW REPLACE WHERE clause inline with CREATE OR REFRESH STREAMING TABLE:

SQL
CREATE OR REFRESH STREAMING TABLE orders_enriched
SCHEDULE EVERY 1 DAY
FLOW REPLACE WHERE date >= date_add(current_timestamp(), -7) BY NAME
SELECT
o.order_id,
o.date,
o.region,
p.product_name,
o.qty,
o.price
FROM orders_fct o
JOIN product_dim p
ON o.product_id = p.product_id;

During refresh, all rows in the target table that match the predicate are deleted, the source query is recomputed for that same predicate range, and the new results are inserted. In this example, all rows from the last 7 days are deleted from orders_enriched and recomputed from the source query.

You don't need to add the predicate to the source query. The pipeline engine automatically applies it when reading from the source.

note

BY NAME is required. It matches columns by name rather than by position.

note

The CREATE FLOW syntax and Python syntax are not supported for streaming tables created in Databricks SQL. Define the REPLACE WHERE clause inline in the CREATE OR REFRESH STREAMING TABLE statement.

Backfill with DML statements

Predicate overrides are not supported for streaming tables created in Databricks SQL. To perform backfills, such as loading historical data, correcting a column for a specific period, or loading from a legacy table, run DML statements directly on the target table.

SQL
INSERT INTO orders_enriched
SELECT *
FROM orders_enriched_legacy
WHERE date < '2025-01-01';

Rows inserted through DML are not subject to the REPLACE WHERE predicate and persist across scheduled refreshes unless they fall within the predicate range of a future run.

For pipelines that support predicate overrides, see Perform a backfill with predicate overrides.

Full refresh behavior

A full refresh on a REPLACE WHERE flow only re-executes the predicate range, not the full source query. Review the warning below before running a full refresh.

warning

A full refresh clears all existing data and re-executes the flow using only its defined predicate. If a table has been refreshing for a year with a 7-day predicate, a full refresh results in the table containing only the last 7 days of data. All older rows are permanently deleted.

SQL
REFRESH STREAMING TABLE orders_enriched FULL;

To prevent full refreshes on a table, set the table property pipelines.reset.allowed to false:

SQL
CREATE OR REFRESH STREAMING TABLE orders_enriched
TBLPROPERTIES (pipelines.reset.allowed = 'false')
FLOW REPLACE WHERE date >= date_add(current_timestamp(), -7) BY NAME
...

Examples

The following examples show common REPLACE WHERE flow patterns in Databricks SQL.

Keep historical aggregates from a limited-retention source

This example maintains daily aggregates indefinitely, even after raw data ages out of the source table, which has a 3-day retention period:

SQL
CREATE OR REFRESH STREAMING TABLE events_agg
FLOW REPLACE WHERE date >= date_add(current_date(), -3) BY NAME
SELECT
date,
key,
SUM(val) AS agg
FROM events_raw
GROUP BY ALL;

Prevent recomputation when a dimension table changes

This example keeps historical fact rows unchanged when dimension attributes change:

SQL
CREATE OR REFRESH STREAMING TABLE fact_dim_join
FLOW REPLACE WHERE f.date >= date_add(current_date(), -1) BY NAME
SELECT
f.date,
f.user_id,
d.region,
f.revenue
FROM fact_table f
JOIN dim_users d
ON f.user_id = d.user_id;

If a user's region changes, only recent rows are recomputed. Historical rows retain the region value at the time they were written.

Add a new metric without recomputing full history

This example shows how to evolve a table definition and backfill only a targeted range:

  1. Define the initial table:

    SQL
    CREATE OR REFRESH STREAMING TABLE clickstream_daily
    FLOW REPLACE WHERE event_date >= date_add(current_date(), -7) BY NAME
    SELECT
    event_date,
    page_id,
    COUNT(*) AS clicks
    FROM clickstream_raw
    GROUP BY ALL;
  2. Update the query to add uniq_users:

    SQL
    CREATE OR REFRESH STREAMING TABLE clickstream_daily
    FLOW REPLACE WHERE event_date >= date_add(current_date(), -7) BY NAME
    SELECT
    event_date,
    page_id,
    COUNT(*) AS clicks,
    COUNT(DISTINCT user_id) AS uniq_users
    FROM clickstream_raw
    GROUP BY ALL;
  3. Backfill the new metric using DML:

    SQL
    INSERT INTO clickstream_daily
    SELECT
    event_date,
    page_id,
    COUNT(*) AS clicks,
    COUNT(DISTINCT user_id) AS uniq_users
    FROM clickstream_raw
    WHERE event_date BETWEEN '2026-01-01' AND '2026-01-30'
    GROUP BY ALL;

    Rows older than the backfilled range contain NULL for uniq_users.