What are pipelines?

A pipeline is the main unit of development and execution of Apache Spark™ Declarative Pipelines (SDP) in Lakeflow. A pipeline is a collection of source code files and a configuration. The source files declare datasets (streaming tables, materialized views, and views) along with the queries and flows that produce them. The configuration specifies how the pipeline runs and where data is stored.

A pipeline is the container for the flows, streaming tables, materialized views, and sinks that you define. While the pipeline runs, it analyzes the dependencies between these objects and orchestrates their order of execution and parallelization automatically. For details on the objects that a pipeline contains, see What are Lakeflow pipelines?. For a comparison of Lakeflow pipelines and Apache Spark™ Declarative Pipelines, see Apache Spark Declarative Pipelines.

Pipeline source code

Pipeline source code is written in Python or SQL. A single pipeline can mix Python and SQL source files, but each file can contain only one language. Because the pipeline analyzes dataset dependencies across all of its source files, you can organize source code across files in any order.

For language-specific development guidance, see Develop pipeline code with Python and Develop Lakeflow pipelines code with SQL.

Pipeline graph

Pipelines automatically infer dependencies between datasets and arrange them in a directed acyclic graph (DAG). The graph determines evaluation order: upstream datasets are computed before downstream ones. You can view and interact with the pipeline graph in the Lakeflow Pipelines Editor.

Pipeline updates

A pipeline update computes the current state of each dataset by:

Starting a cluster with the correct configuration.
Analyzing source files and building the dependency graph.
Computing or incrementally updating each dataset in dependency order.

Pipelines run in two modes:

Triggered: The pipeline runs once and stops when all datasets are up to date.
Continuous: The pipeline runs indefinitely and processes new data as it arrives.

Updates you trigger interactively from the editor optimize for fast iteration, reusing the cluster and disabling automatic retries. See Update run behavior.

Pipeline types

The Jobs & Pipelines list includes more than just pipelines created with Lakeflow pipelines. Databricks runs multiple different types of pipelines, and the Jobs & Pipelines list and the pipeline monitoring page label each one with a type so that you can tell which is which. The following table maps each pipeline type to the pipeline_type value recorded in the event log:

Type in Jobs & Pipelines	`pipeline_type` in event log	Description
ETL	`WORKSPACE`	A Lakeflow pipeline. See Spark Declarative Pipelines.
Ingestion	`MANAGED_INGESTION`	A managed ingestion pipeline created with Lakeflow Connect. See Managed connectors in Lakeflow Connect.
MV/ST	`DBSQL`	A standalone pipeline. See Standalone pipelines.

Type in Jobs & Pipelines	`pipeline_type` in event log	Description
ETL	`WORKSPACE`	A Lakeflow pipeline. See Spark Declarative Pipelines.
Ingestion	`MANAGED_INGESTION`	A managed ingestion pipeline created with Lakeflow Connect. See Managed connectors in Lakeflow Connect.
MV/ST	`DBSQL`	A standalone pipeline. See Standalone pipelines.

Standalone pipelines

You can create and manage streaming tables and materialized views outside of a Lakeflow pipeline as standalone pipelines. You can use Databricks SQL or Python to create and refresh standalone streaming tables and materialized views. They run on the same Databricks infrastructure and have the same processing semantics as they do in a Lakeflow pipeline. When you define a standalone streaming table or materialized view, flows are defined implicitly as part of the streaming table or materialized view definition.

For details, see Standalone pipelines.

Lakeflow Pipelines Editor

The Lakeflow Pipelines Editor is an IDE built for pipeline development. It provides:

A multi-file code editor for Python and SQL source files
A pipeline assets browser for organizing files and folders
An interactive pipeline graph showing dataset dependencies and state
Data previews for streaming tables and materialized views
Execution insights and an issues pane showing results from the latest run
Selective execution to refresh individual files or tables without running the full pipeline

The editor integrates with the Databricks platform and supports version control via Git folders. For step-by-step guidance, see Develop and debug ETL pipelines with the Lakeflow Pipelines Editor.

Pipeline source code​

Pipeline graph​

Pipeline updates​

Pipeline types​

Standalone pipelines​

Lakeflow Pipelines Editor​

Additional resources​