Run an update in Lakeflow Declarative Pipelines

This article explains pipeline updates and provides details on how to trigger an update.

What is a pipeline update?

After you create a pipeline and are ready to run it, you start an update. A pipeline update does the following:

Starts a cluster with the correct configuration.
Discovers all the defined tables and views and checks for any analysis errors such as not valid column names, missing dependencies, and syntax errors.
Creates or updates tables and views with the most recent data available.

Using a dry run, you can check for problems in a pipeline's source code without waiting for tables to be created or updated. This feature is useful when developing or testing pipelines because it lets you quickly find and fix errors in your pipeline, such as incorrect table or column names.

How are pipeline updates triggered?

Use one of the following options to start pipeline updates:

Update trigger	Details
Manual	You can manually trigger pipeline updates from the Lakeflow Pipelines Editor, or the pipelines list. See Manually trigger a pipeline update.
Scheduled	You can schedule updates for pipelines using jobs. See Pipeline task for jobs.
Programmatic	You can programmatically trigger updates using third-party tools, APIs, and CLIs. See Run Lakeflow Declarative Pipelines in a workflow and Pipeline API.

Manually trigger a pipeline update

Use one of the following options to manually trigger a pipeline update:

Run the full pipeline, or a subset of the pipeline (a single source file, or a single table), from the Lakeflow Pipelines Editor. For more information, see Run pipeline code.
Run the full pipeline from the Jobs & Pipelines list. Click in the same row as the pipeline in the list.
From the pipeline monitoring page, click the button.

note

The default behavior for manually triggered pipeline updates is to refresh all datasets defined in the pipeline.

Pipeline refresh semantics

The following table describes the default refresh, full refresh, and reset checkpoints behavior for materialized views and streaming tables:

Update type	Materialized view	Streaming table
Refresh (default)	Updates results to reflect the current results for the defining query. Will examine the costs, and perform an incremental refresh if it is more cost-efficient.	Processes new records through logic defined in streaming tables and flows.
Full refresh	Updates results to reflect the current results for the defining query.	Clears data from streaming tables, clears state information (checkpoints) from flows, and reprocesses all records from the data source.
Reset streaming flow checkpoints	Not applicable to materialized views.	Clears state information (checkpoints) from flows but does not clear data from streaming tables, and reprocesses all records from the data source.

By default, all materialized views and streaming tables in a pipeline refresh with each update. You can optionally omit tables from updates using the following features:

Select tables for refresh: Use this UI to add or remove materialized views and streaming tables before running an update. See Start a pipeline update for selected tables.
Refresh failed tables: Start an update for failed materialized views and streaming tables, including downstream dependencies. See Start a pipeline update for failed tables.

Both of these features support default refresh semantics or full refresh. You can optionally use the Select tables for refresh dialog to exclude additional tables when running a refresh for failed tables.

For streaming tables, you can choose to clear the streaming checkpoints for selected flows and not the data from the associated streaming tables. To clear the checkpoints for selected flows, use the Databricks REST API to start a refresh. See Start a pipeline update to clear selective streaming flows' checkpoints.

Should I use a full refresh?

Databricks recommends running full refreshes only when necessary. A full refresh always reprocesses all records from the specified data sources through the logic that defines the dataset. The time and resources to complete a full refresh are correlated to the size of the source data.

Materialized views return the same results whether default or full refresh is used. Using a full refresh with streaming tables resets all state processing and checkpoint information and can result in dropped records if input data is no longer available.

Databricks only recommends full refresh when the input data sources contain the data needed to recreate the desired state of the table or view. Consider the following scenarios where input source data is no longer available and the outcome of running a full refresh:

Data source	Reason input data is absent	Outcome of full refresh
Kafka	Short retention threshold	Records no longer present in the Kafka source are dropped from the target table.
Files in object storage	Lifecycle policy	Data files no longer present in the source directory are dropped from the target table.
Records in a table	Deleted for compliance	Only records present in the source table are processed.

To prevent full refreshes from being run on a table or view, set the table property pipelines.reset.allowed to false. See Lakeflow Declarative Pipelines table properties. You can also use an append flow to append data to an existing streaming table without requiring a full refresh.

Start a pipeline update for selected tables

You can optionally reprocess data for only selected tables in your pipeline. For example, during development, you only change a single table and want to reduce testing time, or a pipeline update fails and you want to refresh only the failed tables.

The Lakeflow Pipelines Editor has options for reprocessing a source file, selected tables, or a single table. For details, see Run pipeline code.

Start a pipeline update for failed tables

If a pipeline update fails because of errors in one or more tables in the pipeline graph, you can start an update of only failed tables and any downstream dependencies.

note

Excluded tables are not refreshed, even if they depend on a failed table.

To update failed tables, on the pipeline monitoring page, click Refresh failed tables.

To update only selected failed tables from the pipeline monitoring page:

Click next to the Refresh failed tables button and click Select tables for refresh. The Select tables for refresh dialog appears.
To select the tables to refresh, click each table. The selected tables are highlighted and labeled. To remove a table from the update, click the table again.
Click Refresh selection.

note
The Refresh selection button displays the number of selected tables in parentheses.

To reprocess data already ingested for the selected tables, click next to the Refresh selection button and click Full Refresh selection.

Start a pipeline update to clear selective streaming flows' checkpoints

You can optionally reprocess data for selected streaming flows in your pipeline without clearing any already ingested data.

note

Flows that are not selected are run using a REFRESH update. You can also specify full_refresh_selection or refresh_selection to selectively refresh other tables.

To start an update to refresh the selected streaming checkpoints, use the updates request in the Lakeflow Declarative Pipelines REST API. The following example uses the curl command to call the updates request to start a pipeline update:

Bash
curl -X POST \
-H "Authorization: Bearer <your-token>" \
-H "Content-Type: application/json" \
-d '{
"reset_checkpoint_selection": [<streaming flow1>, <streaming flow 2>...]
}' \
https://<your-databricks-instance>/api/2.0/pipelines/<your-pipeline-id>/updates

Check a pipeline for errors without waiting for tables to update

Preview

The Lakeflow Declarative Pipelines Dry run feature is in Public Preview.

To check whether a pipeline's source code is valid without running a full update, use a dry run. A dry run resolves the definitions of datasets and flows defined in the pipeline but does not materialize or publish any datasets. Errors found during the dry run, such as incorrect table or column names, are reported in the UI.

To start a dry run, click on the pipeline details page next to Start and click Dry run.

After the dry run is complete, any errors are shown in the event tray in the bottom panel. Clicking the event tray will display any issues found in the bottom panel. Additionally, the event log shows events related only to the dry run, and no metrics are displayed in the DAG. If errors are found, details are available in the event log.

You can see results for only the most recent dry run. If the dry run was the most recently run update, you can see the results by selecting it in the update history. If another update is run after the dry run, the results are no longer available in the UI.

Development mode

Pipelines run from the Lakeflow Pipelines Editor run with development mode turned on. Pipelines that are scheduled default to running with development mode turned off. If you want to test how the pipeline will run in production, you can interactively choose whether to use development mode by choosing Run with different settings from the drop-down in the editor.

note

Pipelines created with the legacy notebook editor default to using development mode. You can check or change the setting by choosing Settings in the pipeline monitoring page. The monitoring page is available from the Jobs & Pipelines button on the left side of your workspace. You can also jump directly to the monitoring page from the pipeline editor by clicking the run results in the pipeline assets browser.

When you run your pipeline in development mode, the Lakeflow Declarative Pipelines system does the following:

Reuses a cluster to avoid the overhead of restarts. By default, clusters run for two hours when development mode is enabled. You can change this with the pipelines.clusterShutdown.delay setting in the Configure classic compute for Lakeflow Declarative Pipelines.
Disables pipeline retries so you can immediately detect and fix errors.

With development mode turned off, the Lakeflow Declarative Pipelines system does the following:

Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
Retries execution in the event of specific errors, such as a failure to start a cluster.

note

Switching development mode on and off only controls cluster and pipeline execution behavior. Storage locations and target schemas in the catalog for publishing tables must be configured as part of pipeline settings and are not affected when switching between modes.

What is a pipeline update?​

How are pipeline updates triggered?​

Manually trigger a pipeline update​

Pipeline refresh semantics​

Should I use a full refresh?​

Start a pipeline update for selected tables​

Start a pipeline update for failed tables​

Start a pipeline update to clear selective streaming flows' checkpoints​

Check a pipeline for errors without waiting for tables to update​

Development mode​

What is a pipeline update?

How are pipeline updates triggered?

Manually trigger a pipeline update

Pipeline refresh semantics

Should I use a full refresh?

Start a pipeline update for selected tables

Start a pipeline update for failed tables

Start a pipeline update to clear selective streaming flows' checkpoints

Check a pipeline for errors without waiting for tables to update

Development mode