Lakeflow Declarative Pipelines Python language reference
This section has details for the Lakeflow Declarative Pipelines Python programming interface.
- For conceptual information and an overview of using Python for Lakeflow Declarative Pipelines, see Develop pipeline code with Python.
- For SQL reference, see the Lakeflow Declarative Pipelines SQL language reference.
- For details specific to configuring Auto Loader, see What is Auto Loader?.
dp
module overview
Lakeflow Declarative Pipelines Python functions are defined in the pyspark.pipelines
module (imported as dp
). Your pipelines implemented with the Python API must import this module:
from pyspark import pipelines as dp
The public, open source version of pyspark
also includes the pipelines
module. Much of the code is compatible with the version that is used within Databricks. Code written in the open source version works in Databricks. However, there are a few features in the Databricks version of pipelines
that will not work with the OSS pyspark
. The following features are not compatible:
dp.create_auto_cdc_flow
dp.create_auto_cdc_from_snapshot_flow
@dp.expect(...)
@dp.temporary_view
What happened to @dlt
?
Previously, Databricks used the dlt
module to support Lakeflow Declarative Pipelines functionality. The dlt
module has been replaced by the pyspark.pipelines
module. You may still use dlt
, but Databricks recommends using pipelines
.
Functions for dataset definitions
Lakeflow Declarative Pipelines uses Python decorators for defining datasets such as materialized views and streaming tables. See Functions to define datasets.
API reference
- append_flow
- create_auto_cdc_flow
- create_auto_cdc_from_snapshot_flow
- create_sink
- create_streaming_table
- Expectations
- materialized_view
- table
- temporary_view
Considerations for Python Lakeflow Declarative Pipelines
The following are important considerations when you implement pipelines with the Lakeflow Declarative Pipelines Python interface:
- Lakeflow Declarative Pipelines evaluates the code that defines a pipeline multiple times during planning and pipeline runs. Python functions that define datasets should include only the code required to define the table or view. Arbitrary Python logic included in dataset definitions might lead to unexpected behavior.
- Do not try to implement custom monitoring logic in your dataset definitions. See Define custom monitoring of Lakeflow Declarative Pipelines with event hooks.
- The function used to define a dataset must return a Spark DataFrame. Do not include logic in your dataset definitions that does not relate to a returned DataFrame.
- Never use methods that save or write to files or tables as part of your Lakeflow Declarative Pipelines dataset code.
Examples of Apache Spark operations that should never be used in Lakeflow Declarative Pipelines code:
collect()
count()
toPandas()
save()
saveAsTable()
start()
toTable()