Develop Lakeflow Declarative Pipelines
Developing and testing pipeline code differs from other Apache Spark workloads. This article provides an overview of supported functionality, best practices, and considerations when developing pipeline code. For more recommendations and best practices, see Applying software development & DevOps best practices to Lakeflow Declarative Pipelines.
You must add source code to a pipeline configuration to validate code or run an update. See Configure Lakeflow Declarative Pipelines.
What files are valid for pipeline source code?
Lakeflow Declarative Pipelines code can be Python or SQL. You can have a mix of Python and SQL source code files backing a single pipeline, but each file can only contain one language. See Develop pipeline code with Python and Develop pipeline code with SQL.
Source files for pipelines are stored in your workspace. Workspace files represent Python or SQL scripts authored in the Lakeflow Pipelines editor. You can also edit the files locally in your preferred IDE, and sync to the workspace. For information about files in the workspace, see What are workspace files?. For information about editing with the Lakeflow Pipelines Editor, see Develop and debug ETL pipelines with the Lakeflow Pipelines Editor. For information about authoring code in a local IDE, see Develop Lakeflow Declarative Pipelines code in your local development environment.
If you develop Python code as modules or libraries, you must install and import the code and then call methods from a Python file configured as source code. See Manage Python dependencies for Lakeflow Declarative Pipelines.
If you need to use arbitrary SQL commands in a Python notebook, you can use the syntax pattern spark.sql("<QUERY>")
to run SQL as Python code.
Unity Catalog functions allow you to register arbitrary Python user-defined functions for use in SQL. See User-defined functions (UDFs) in Unity Catalog.
Overview of Lakeflow Declarative Pipelines development features
Lakeflow Declarative Pipelines extends and leverages many Databricks features, and introduces new features and concepts. The following table provides a brief overview of concepts and features that support pipeline code development:
Feature | Description |
---|---|
Development mode | Running pipelines interactively (by choosing to update through the Lakeflow Pipelines Editor) will use development mode. New pipelines are configured to run with development mode off when running automatically through a schedule or automated trigger. See Development mode. |
Dry run | A Dry run update verifies the correctness of pipeline source code without running an update on any tables. See Check a pipeline for errors without waiting for tables to update. |
Lakeflow Pipelines Editor | Python and SQL files configured as source code for Lakeflow Declarative Pipelines provide interactive options for validating code and running updates. See Develop and debug ETL pipelines with the Lakeflow Pipelines Editor. |
Parameters | Leverage parameters in source code and pipeline configurations to simplify testing and extensibility. See Use parameters with Lakeflow Declarative Pipelines. |
Databricks Asset Bundles | Databricks Asset Bundles allow you to move pipeline configurations and source code between workspaces. See Convert Lakeflow Declarative Pipelines into a Databricks Asset Bundle project. |
Create sample datasets for development and testing
Databricks recommends creating development and test datasets to test pipeline logic with expected data and potentially malformed or corrupt records. There are multiple ways to create datasets that can be useful for development and testing, including the following:
- Select a subset of data from a production dataset.
- Use anonymized or artificially generated data for sources containing PII. To see a tutorial that uses the
faker
library to generate data for testing, see Tutorial: Build an ETL pipeline using change data capture with Lakeflow Declarative Pipelines. - Create test data with well-defined outcomes based on downstream transformation logic.
- Anticipate potential data corruption, malformed records, and upstream data changes by creating records that break data schema expectations.
For example, if you have a file that defines a dataset using the following code:
CREATE OR REFRESH STREAMING TABLE input_data
AS SELECT * FROM STREAM read_files(
"/production/data",
format => "json")
You could create a sample dataset containing a subset of the records using a query like the following:
CREATE OR REFRESH MATERIALIZED VIEW input_data AS
SELECT "2021/09/04" AS date, 22.4 as sensor_reading UNION ALL
SELECT "2021/09/05" AS date, 21.5 as sensor_reading
The following example demonstrates filtering published data to create a subset of the production data for development or testing:
CREATE OR REFRESH MATERIALIZED VIEW input_data AS SELECT * FROM prod.input_data WHERE date > current_date() - INTERVAL 1 DAY
To use these different datasets, create multiple pipelines with the source code implementing the transformation logic. Each pipeline can read data from the input_data
dataset but is configured to include the file that creates the dataset specific to the environment.
How do Lakeflow Declarative Pipelines datasets process data?
The following table describes how materialized views, streaming tables, and views process data:
Dataset type | How are records processed through defined queries? |
---|---|
Streaming table | Each record is processed exactly once. This assumes an append-only source. |
Materialized view | Records are processed as required to return accurate results for the current data state. Materialized views should be used for data processing tasks such as transformations, aggregations, or pre-computing slow queries and frequently used computations. Results are cached between updates. |
View | Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets. |
Declare your first datasets in Lakeflow Declarative Pipelines
Lakeflow Declarative Pipelines introduces new syntax for Python and SQL. To learn the basics of pipeline syntax, see Develop pipeline code with Python and Develop pipeline code with SQL.
Lakeflow Declarative Pipelines separates dataset definitions from update processing, and Lakeflow Declarative Pipelines source is not intended for interactive execution.
How do you configure Lakeflow Declarative Pipelines?
The settings for Lakeflow Declarative Pipelines fall into two broad categories:
- Configurations that define a collection of files (known as source code) that use Lakeflow Declarative Pipelines syntax to declare datasets.
- Configurations that control pipeline infrastructure, dependency management, how updates are processed, and how tables are saved in the workspace.
Most configurations are optional, but some require careful attention, especially when configuring production pipelines. These include the following:
- To make data available outside the pipeline, you must declare a target schema to publish to the Hive metastore or a target catalog and target schema to publish to Unity Catalog.
- Data access permissions are configured through the cluster used for execution. Ensure your cluster has appropriate permissions configured for data sources and the target storage location, if specified.
For details on using Python and SQL to write source code for pipelines, see Lakeflow Declarative Pipelines SQL language reference and Lakeflow Declarative Pipelines Python language reference.
For more on pipeline settings and configurations, see Configure Lakeflow Declarative Pipelines.
Deploy your first pipeline and trigger updates
Before processing data with Lakeflow Declarative Pipelines, you must configure a pipeline. After a pipeline is configured, you can trigger an update to calculate results for each dataset in your pipeline. To get started using Lakeflow Declarative Pipelines, see Tutorial: Build an ETL pipeline using change data capture with Lakeflow Declarative Pipelines.
What is a pipeline update?
Pipelines deploy infrastructure and recompute data state when you start an update. An update does the following:
- Starts a cluster with the correct configuration.
- Discovers all the tables and views defined and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.
- Creates or updates tables and views with the most recent data available.
Pipelines can be run continuously or on a schedule depending on your use case's cost and latency requirements. See Run an update in Lakeflow Declarative Pipelines.
Ingest data with Lakeflow Declarative Pipelines
Lakeflow Declarative Pipelines supports all data sources available in Databricks.
Databricks recommends using streaming tables for most ingestion use cases. For files arriving in cloud object storage, Databricks recommends Auto Loader. You can directly ingest data with Lakeflow Declarative Pipelines from most message buses.
For more information about configuring access to cloud storage, see Cloud storage configuration.
For formats not supported by Auto Loader, you can use Python or SQL to query any format supported by Apache Spark. See Load data with Lakeflow Declarative Pipelines.
Monitor and enforce data quality
You can use expectations to specify data quality controls on the contents of a dataset. Unlike a CHECK
constraint in a traditional database which prevents adding any records that fail the constraint, expectations provide flexibility when processing data that fails data quality requirements. This flexibility allows you to process and store data that you expect to be messy and data that must meet strict quality requirements. See Manage data quality with pipeline expectations.
How are Lakeflow Declarative Pipelines and Delta Lake related?
Lakeflow Declarative Pipelines extends the functionality of Delta Lake. Because tables created and managed by Lakeflow Declarative Pipelines are Delta tables, they have the same guarantees and features provided by Delta Lake. See What is Delta Lake in Databricks?.
Lakeflow Declarative Pipelines adds several table properties in addition to the many table properties that can be set in Delta Lake. See Lakeflow Declarative Pipelines properties reference and Delta table properties reference.
How tables are created and managed by Lakeflow Declarative Pipelines
Databricks automatically manages tables created with Lakeflow Declarative Pipelines, determining how updates need to be processed to correctly compute the current state of a table and performing a number of maintenance and optimization tasks.
For most operations, you should allow Lakeflow Declarative Pipelines to process all updates, inserts, and deletes to a target table. For details and limitations, see Retain manual deletes or updates.
Maintenance tasks performed by Lakeflow Declarative Pipelines
Lakeflow Declarative Pipelines performs maintenance tasks on an optimal cadence using predictive optimization. Maintenance can improve query performance and reduce cost by removing old versions of tables. This includes full OPTIMIZE and VACUUM operations. Maintenance tasks are performed on a schedule decided by predictive optimization, and only if a pipeline update has run since the previous maintenance.
To understand how often predictive optimization runs, and to understand maintenance costs, see Predictive optimization system table reference.
Limitations
For a list of limitations, see Lakeflow Declarative Pipelines Limitations.
For a list of requirements and limitations that are specific to using Lakeflow Declarative Pipelines with Unity Catalog, see Use Unity Catalog with your Lakeflow Declarative Pipelines
Additional resources
- Lakeflow Declarative Pipelines has full support in the Databricks REST API. See Lakeflow Declarative Pipelines API.
- For pipeline and table settings, see Lakeflow Declarative Pipelines properties reference.
- Lakeflow Declarative Pipelines SQL language reference.
- Lakeflow Declarative Pipelines Python language reference.