Develop Lakeflow Spark Declarative Pipelines

Developing and testing pipeline code differs from other Apache Spark workloads. This article provides an overview of supported functionality, best practices, and considerations when developing pipeline code. For more recommendations and best practices, see Applying software development & DevOps best practices to pipelines.

note

You must add source code to a pipeline configuration to validate code or run an update. See Configure Pipelines.

What files are valid for pipeline source code?

Pipeline code can be Python or SQL. You can have a mix of Python and SQL source code files backing a single pipeline, but each file can only contain one language. See Develop pipeline code with Python and Develop Lakeflow Spark Declarative Pipelines code with SQL.

Source files for pipelines are stored in your workspace. Workspace files represent Python or SQL scripts authored in the Lakeflow Pipelines Editor. You can also edit the files locally in your preferred IDE, and sync to the workspace. For information about files in the workspace, see What are workspace files?. For information about editing with the Lakeflow Pipelines Editor, see Develop and debug ETL pipelines with the Lakeflow Pipelines Editor. For information about authoring code in a local IDE, see Develop pipeline code in your local development environment.

If you develop Python code as modules or libraries, you must install and import the code and then call methods from a Python file configured as source code. See Manage Python dependencies for pipelines.

note

If you need to use arbitrary SQL commands in a Python notebook, you can use the syntax pattern spark.sql("<QUERY>") to run SQL as Python code.

Unity Catalog functions allow you to register arbitrary Python user-defined functions for use in SQL. See User-defined functions (UDFs) in Unity Catalog.

Overview of pipeline development features

Pipelines extend and leverages many Databricks development features, and introduce new features and concepts. The following table provides a brief overview of concepts and features that support pipeline code development:

Feature	Description
Development mode	Running pipelines interactively (by choosing to update through the Lakeflow Pipelines Editor) will use development mode. New pipelines run with development mode off when running automatically through a schedule or automated trigger. See Development mode.
Dry run	A Dry run update verifies the correctness of pipeline source code without running an update on any tables. See Check a pipeline for errors without waiting for tables to update.
Lakeflow Pipelines Editor	Python and SQL files configured as source code for pipelines provide interactive options for validating code and running updates. See Develop and debug ETL pipelines with the Lakeflow Pipelines Editor.
Parameters	Leverage parameters in source code and pipeline configurations to simplify testing and extensibility. See Use parameters with pipelines.
Databricks Asset Bundles	Databricks Asset Bundles allow you to move pipeline configurations and source code between workspaces. See Convert a pipeline into a Databricks Asset Bundle project.

Create sample datasets for development and testing

Databricks recommends creating development and test datasets to test pipeline logic with expected data and potentially malformed or corrupt records. There are multiple ways to create datasets that can be useful for development and testing, including the following:

Select a subset of data from a production dataset.
Use anonymized or artificially generated data for sources containing PII. To see a tutorial that uses the faker library to generate data for testing, see Tutorial: Build an ETL pipeline using change data capture.
Create test data with well-defined outcomes based on downstream transformation logic.
Anticipate potential data corruption, malformed records, and upstream data changes by creating records that break data schema expectations.

For example, if you have a file that defines a dataset using the following code:

SQL
CREATE OR REFRESH STREAMING TABLE input_data
AS SELECT * FROM STREAM read_files(
  "/production/data",
  format => "json")

You could create a sample dataset containing a subset of the records using a query like the following:

SQL
CREATE OR REFRESH MATERIALIZED VIEW input_data AS
SELECT "2021/09/04" AS date, 22.4 as sensor_reading UNION ALL
SELECT "2021/09/05" AS date, 21.5 as sensor_reading

The following example demonstrates filtering published data to create a subset of the production data for development or testing:

SQL
CREATE OR REFRESH MATERIALIZED VIEW input_data AS SELECT * FROM prod.input_data WHERE date > current_date() - INTERVAL 1 DAY

To use these different datasets, create multiple pipelines with the source code implementing the transformation logic. Each pipeline can read data from the input_data dataset but is configured to include the file that creates the dataset specific to the environment.

How do pipeline datasets process data?

The following table describes how materialized views, streaming tables, and views process data:

Dataset type	How are records processed through defined queries?
Streaming table	Each record is processed exactly once. This assumes an append-only source.
Materialized view	Records are processed as required to return accurate results for the current data state. Materialized views should be used for data processing tasks such as transformations, aggregations, or pre-computing slow queries and frequently used computations. Results are cached between updates.
View	Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets.

Declare your first datasets in pipelines

Pipelines introduce new syntax for Python and SQL. To learn the basics of pipeline syntax, see Develop pipeline code with Python and Develop Lakeflow Spark Declarative Pipelines code with SQL.

note

Pipelines separate dataset definitions from update processing, and pipeline source is not intended for interactive execution.

How do you configure pipelines?

The settings for a pipeline fall into two broad categories:

Configurations that define a collection of files (known as source code) that use pipeline syntax to declare datasets.
Configurations that control pipeline infrastructure, dependency management, how updates are processed, and how tables are saved in the workspace.

Most configurations are optional, but some require careful attention, especially when configuring production pipelines. These include the following:

To make data available outside the pipeline, you must declare a target schema to publish to the Hive metastore or a target catalog and target schema to publish to Unity Catalog.
Data access permissions are configured through the cluster used for execution. Ensure your cluster has appropriate permissions configured for data sources and the target storage location, if specified.

For details on using Python and SQL to write source code for pipelines, see Pipeline SQL language reference and Lakeflow Spark Declarative Pipelines Python language reference.

For more on pipeline settings and configurations, see Configure Pipelines.

Deploy your first pipeline and trigger updates

To process data with SDP, configure a pipeline. After a pipeline is configured, you can trigger an update to calculate results for each dataset in your pipeline. To get started using pipelines, see Tutorial: Build an ETL pipeline using change data capture.

What is a pipeline update?

Pipelines deploy infrastructure and recompute data state when you start an update. An update does the following:

Starts a cluster with the correct configuration.
Discovers all the tables and views defined and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.
Creates or updates tables and views with the most recent data available.

Pipelines can be run continuously or on a schedule depending on your use case's cost and latency requirements. See Run a pipeline update.

Ingest data with pipelines

Pipelines support all data sources available in Databricks.

Databricks recommends using streaming tables for most ingestion use cases. For files arriving in cloud object storage, Databricks recommends Auto Loader. You can directly ingest data with a pipeline from most message buses.

For more information about configuring access to cloud storage, see Cloud storage configuration.

For formats not supported by Auto Loader, you can use Python or SQL to query any format supported by Apache Spark. See Load data in pipelines.

Monitor and enforce data quality

You can use expectations to specify data quality controls on the contents of a dataset. Unlike a CHECK constraint in a traditional database which prevents adding any records that fail the constraint, expectations provide flexibility when processing data that fails data quality requirements. This flexibility allows you to process and store data that you expect to be messy and data that must meet strict quality requirements. See Manage data quality with pipeline expectations.

SDP extends the functionality of Delta Lake. Because tables created and managed by pipelines are Delta tables, they have the same guarantees and features provided by Delta Lake. See What is Delta Lake in Databricks?.

Pipelines add several table properties in addition to the many table properties that can be set in Delta Lake. See Pipeline properties reference and Table properties reference.

How tables are created and managed by pipelines

Databricks automatically manages tables created by pipelines, determining how updates need to be processed to correctly compute the current state of a table and performing a number of maintenance and optimization tasks.

For most operations, you should allow the pipeline to process all updates, inserts, and deletes to a target table. For details and limitations, see Retain manual deletes or updates.

Maintenance tasks performed by pipelines

Databricks performs maintenance tasks on tables managed by pipelines on an optimal cadence using predictive optimization. Maintenance can improve query performance and reduce cost by removing old versions of tables. This includes full OPTIMIZE and VACUUM operations. Maintenance tasks are performed on a schedule decided by predictive optimization, and only if a pipeline update has run since the previous maintenance.

To understand how often predictive optimization runs, and to understand maintenance costs, see Predictive optimization system table reference.

Limitations

For a list of limitations, see Pipeline Limitations.

For a list of requirements and limitations that are specific to using pipelines with Unity Catalog, see Use Unity Catalog with pipelines

Additional resources

Pipelines have full support in the Databricks REST API. See Pipelines REST API.
For pipeline and table settings, see Pipeline properties reference.
Pipeline SQL language reference.
Lakeflow Spark Declarative Pipelines Python language reference.

What files are valid for pipeline source code?​

Overview of pipeline development features​

Create sample datasets for development and testing​

How do pipeline datasets process data?​

Declare your first datasets in pipelines​

How do you configure pipelines?​

Deploy your first pipeline and trigger updates​

What is a pipeline update?​

Ingest data with pipelines​

Monitor and enforce data quality​

How are Lakeflow Spark Declarative Pipelines and Delta Lake related?​

How tables are created and managed by pipelines​

Maintenance tasks performed by pipelines​

Limitations​

Additional resources​

What files are valid for pipeline source code?

Overview of pipeline development features

Create sample datasets for development and testing

How do pipeline datasets process data?

Declare your first datasets in pipelines

How do you configure pipelines?

Deploy your first pipeline and trigger updates

What is a pipeline update?

Ingest data with pipelines

Monitor and enforce data quality

How are Lakeflow Spark Declarative Pipelines and Delta Lake related?

How tables are created and managed by pipelines

Maintenance tasks performed by pipelines

Limitations

Additional resources