Develop Lakeflow Declarative Pipelines

Developing and testing pipeline code differs from other Apache Spark workloads. This article provides an overview of supported functionality, best practices, and considerations when developing pipeline code. For more recommendations and best practices, see Applying software development & DevOps best practices to Lakeflow Declarative Pipelines.

note

You must add source code to a pipeline configuration to validate code or run an update. See Configure Lakeflow Declarative Pipelines.

What files are valid for pipeline source code?

Lakeflow Declarative Pipelines code can be Python or SQL. You can have a mix of Python and SQL source code files backing a single pipeline, but each file can only contain one language. See Develop pipeline code with Python and Develop pipeline code with SQL.

You can use notebooks and workspace files when specifying source code for a pipeline. Workspace files represent Python or SQL scripts authored in your preferred IDE or the Databricks file editor. See What are workspace files?.

If you develop Python code as modules or libraries, you must install and import the code and then call methods from a Python notebook or workspace file configured as source code. See Manage Python dependencies for Lakeflow Declarative Pipelines.

note

If you need to use arbitrary SQL commands in a Python notebook, you can use the syntax pattern spark.sql("<QUERY>") to run SQL as Python code.

Unity Catalog functions allow you to register arbitrary Python user-defined functions for use in SQL. See User-defined functions (UDFs) in Unity Catalog.

Overview of Lakeflow Declarative Pipelines development features

Lakeflow Declarative Pipelines extends and leverages many Databricks features, and introduces new features and concepts. The following table provides a brief overview of concepts and features that support pipeline code development:

Feature	Description
Development mode	New pipelines are configured to run in development mode by default. Databricks recommends using development mode for interactive development and testing. See Development and production modes.
Validate	A `Validate` update verifies the correctness of pipeline source code without running an update on any tables. See Check a pipeline for errors without waiting for tables to update.
Notebooks	Notebooks configured as source code for Lakeflow Declarative Pipelines provide interactive options for validating code and running updates. See Develop and debug ETL pipelines with a notebook in Lakeflow Declarative Pipelines.
Parameters	Leverage parameters in source code and pipeline configurations to simplify testing and extensibility. See Use parameters with Lakeflow Declarative Pipelines.
Databricks Asset Bundles	Databricks Asset Bundles allow you to move pipeline configurations and source code between workspaces. See Convert Lakeflow Declarative Pipelines into a Databricks Asset Bundle project.

Create sample datasets for development and testing

Databricks recommends creating development and test datasets to test pipeline logic with expected data and potentially malformed or corrupt records. There are multiple ways to create datasets that can be useful for development and testing, including the following:

Select a subset of data from a production dataset.
Use anonymized or artificially generated data for sources containing PII.
Create test data with well-defined outcomes based on downstream transformation logic.
Anticipate potential data corruption, malformed records, and upstream data changes by creating records that break data schema expectations.

For example, if you have a notebook that defines a dataset using the following code:

SQL
CREATE OR REFRESH STREAMING TABLE input_data
AS SELECT * FROM STREAM read_files(
  "/production/data",
  format => "json")

You could create a sample dataset containing specific records using a query like the following:

SQL
CREATE OR REFRESH MATERIALIZED VIEW input_data AS
SELECT "2021/09/04" AS date, 22.4 as sensor_reading UNION ALL
SELECT "2021/09/05" AS date, 21.5 as sensor_reading

The following example demonstrates filtering published data to create a subset of the production data for development or testing:

SQL
CREATE OR REFRESH MATERIALIZED VIEW input_data AS SELECT * FROM prod.input_data WHERE date > current_date() - INTERVAL 1 DAY

To use these different datasets, create multiple pipelines with the notebooks implementing the transformation logic. Each pipeline can read data from the input_data dataset but is configured to include the notebook that creates the dataset specific to the environment.

How do Lakeflow Declarative Pipelines datasets process data?

The following table describes how materialized views, streaming tables, and views process data:

Dataset type	How are records processed through defined queries?
Streaming table	Each record is processed exactly once. This assumes an append-only source.
Materialized view	Records are processed as required to return accurate results for the current data state. Materialized views should be used for data processing tasks such as transformations, aggregations, or pre-computing slow queries and frequently used computations.
View	Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets.

Declare your first datasets in Lakeflow Declarative Pipelines

Lakeflow Declarative Pipelines introduces new syntax for Python and SQL. To learn the basics of pipeline syntax, see Develop pipeline code with Python and Develop pipeline code with SQL.

note

Lakeflow Declarative Pipelines separates dataset definitions from update processing, and Lakeflow Declarative Pipelines notebooks are not intended for interactive execution.

How do you configure Lakeflow Declarative Pipelines?

The settings for Lakeflow Declarative Pipelines fall into two broad categories:

Configurations that define a collection of notebooks or files (known as source code) that use Lakeflow Declarative Pipelines syntax to declare datasets.
Configurations that control pipeline infrastructure, dependency management, how updates are processed, and how tables are saved in the workspace.

Most configurations are optional, but some require careful attention, especially when configuring production pipelines. These include the following:

To make data available outside the pipeline, you must declare a target schema to publish to the Hive metastore or a target catalog and target schema to publish to Unity Catalog.
Data access permissions are configured through the cluster used for execution. Ensure your cluster has appropriate permissions configured for data sources and the target storage location, if specified.

For details on using Python and SQL to write source code for pipelines, see Lakeflow Declarative Pipelines SQL language reference and Lakeflow Declarative Pipelines Python language reference.

For more on pipeline settings and configurations, see Configure Lakeflow Declarative Pipelines.

Deploy your first pipeline and trigger updates

Before processing data with Lakeflow Declarative Pipelines, you must configure a pipeline. After a pipeline is configured, you can trigger an update to calculate results for each dataset in your pipeline. To get started using Lakeflow Declarative Pipelines, see Tutorial: Build an ETL pipeline using change data capture with Lakeflow Declarative Pipelines.

What is a pipeline update?

Pipelines deploy infrastructure and recompute data state when you start an update. An update does the following:

Starts a cluster with the correct configuration.
Discovers all the tables and views defined and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.
Creates or updates tables and views with the most recent data available.

Pipelines can be run continuously or on a schedule depending on your use case's cost and latency requirements. See Run an update in Lakeflow Declarative Pipelines.

Ingest data with Lakeflow Declarative Pipelines

Lakeflow Declarative Pipelines supports all data sources available in Databricks.

Databricks recommends using streaming tables for most ingestion use cases. For files arriving in cloud object storage, Databricks recommends Auto Loader. You can directly ingest data with Lakeflow Declarative Pipelines from most message buses.

For more information about configuring access to cloud storage, see Cloud storage configuration.

For formats not supported by Auto Loader, you can use Python or SQL to query any format supported by Apache Spark. See Load data with Lakeflow Declarative Pipelines.

Monitor and enforce data quality

You can use expectations to specify data quality controls on the contents of a dataset. Unlike a CHECK constraint in a traditional database which prevents adding any records that fail the constraint, expectations provide flexibility when processing data that fails data quality requirements. This flexibility allows you to process and store data that you expect to be messy and data that must meet strict quality requirements. See Manage data quality with pipeline expectations.

Lakeflow Declarative Pipelines extends the functionality of Delta Lake. Because tables created and managed by Lakeflow Declarative Pipelines are Delta tables, they have the same guarantees and features provided by Delta Lake. See What is Delta Lake in Databricks?.

Lakeflow Declarative Pipelines adds several table properties in addition to the many table properties that can be set in Delta Lake. See Lakeflow Declarative Pipelines properties reference and Delta table properties reference.

How tables are created and managed by Lakeflow Declarative Pipelines

Databricks automatically manages tables created with Lakeflow Declarative Pipelines, determining how updates need to be processed to correctly compute the current state of a table and performing a number of maintenance and optimization tasks.

For most operations, you should allow Lakeflow Declarative Pipelines to process all updates, inserts, and deletes to a target table. For details and limitations, see Retain manual deletes or updates.

Maintenance tasks performed by Lakeflow Declarative Pipelines

Lakeflow Declarative Pipelines performs maintenance tasks on an optimal cadence using predictive optimization. Maintenance can improve query performance and reduce cost by removing old versions of tables. This includes full OPTIMIZE and VACUUM operations. Maintenance tasks are performed on a schedule decided by predictive optimization, and only if a pipeline update has run since the previous maintenance.

To understand how often predictive optimization runs, and to understand maintenance costs, see Predictive optimization system table reference.

Limitations

For a list of limitations, see Lakeflow Declarative Pipelines Limitations.

For a list of requirements and limitations that are specific to using Lakeflow Declarative Pipelines with Unity Catalog, see Use Unity Catalog with your Lakeflow Declarative Pipelines

Additional resources

Lakeflow Declarative Pipelines has full support in the Databricks REST API. See Lakeflow Declarative Pipelines API.
For pipeline and table settings, see Lakeflow Declarative Pipelines properties reference.
Lakeflow Declarative Pipelines SQL language reference.
Lakeflow Declarative Pipelines Python language reference.

What files are valid for pipeline source code?​

Overview of Lakeflow Declarative Pipelines development features​

Create sample datasets for development and testing​

How do Lakeflow Declarative Pipelines datasets process data?​

Declare your first datasets in Lakeflow Declarative Pipelines​

How do you configure Lakeflow Declarative Pipelines?​

Deploy your first pipeline and trigger updates​

What is a pipeline update?​

Ingest data with Lakeflow Declarative Pipelines​

Monitor and enforce data quality​

How are Lakeflow Declarative Pipelines and Delta Lake related?​

How tables are created and managed by Lakeflow Declarative Pipelines​

Maintenance tasks performed by Lakeflow Declarative Pipelines​

Limitations​

Additional resources​

What files are valid for pipeline source code?

Overview of Lakeflow Declarative Pipelines development features

Create sample datasets for development and testing

How do Lakeflow Declarative Pipelines datasets process data?

Declare your first datasets in Lakeflow Declarative Pipelines

How do you configure Lakeflow Declarative Pipelines?

Deploy your first pipeline and trigger updates

What is a pipeline update?

Ingest data with Lakeflow Declarative Pipelines

Monitor and enforce data quality

How are Lakeflow Declarative Pipelines and Delta Lake related?

How tables are created and managed by Lakeflow Declarative Pipelines

Maintenance tasks performed by Lakeflow Declarative Pipelines

Limitations

Additional resources