Skip to main content

Lakeflow Declarative Pipelines Python language reference

This section has details for the Lakeflow Declarative Pipelines Python programming interface.

pipelines module overview

Lakeflow Declarative Pipelines Python functions are defined in the pyspark.pipelines module (imported as dp). Your pipelines implemented with the Python API must import this module:

Python
from pyspark import pipelines as dp
note

The pipelines module is only available in the context of a pipeline. It is not available in Python running outside of pipelines. For more information about editing pipeline code, see Develop and debug ETL pipelines with the Lakeflow Pipelines Editor.

Apache Spark pipelines

Apache Spark includes declarative pipelines beginning in Spark 4.1, available through the pyspark.pipelines module. The Databricks Runtime extends these open source capabilities with additional APIs and integrations for managed production use.

Code written with the open-source pipelines module runs without modification on Databricks. The following features are not part of Apache Spark:

  • dp.create_auto_cdc_flow
  • dp.create_auto_cdc_from_snapshot_flow
  • @dp.expect(...)
  • @dp.temporary_view

:::

What happened to @dlt?

Previously, Databricks used the dlt module to support Lakeflow Declarative Pipelines functionality. The dlt module has been replaced by the pyspark.pipelines module. You may still use dlt, but Databricks recommends using pipelines.

Functions for dataset definitions

Lakeflow Declarative Pipelines uses Python decorators for defining datasets such as materialized views and streaming tables. See Functions to define datasets.

API reference

Considerations for Python Lakeflow Declarative Pipelines

The following are important considerations when you implement pipelines with the Lakeflow Declarative Pipelines Python interface:

  • Lakeflow Declarative Pipelines evaluates the code that defines a pipeline multiple times during planning and pipeline runs. Python functions that define datasets should include only the code required to define the table or view. Arbitrary Python logic included in dataset definitions might lead to unexpected behavior.
  • Do not try to implement custom monitoring logic in your dataset definitions. See Define custom monitoring of Lakeflow Declarative Pipelines with event hooks.
  • The function used to define a dataset must return a Spark DataFrame. Do not include logic in your dataset definitions that does not relate to a returned DataFrame.
  • Never use methods that save or write to files or tables as part of your Lakeflow Declarative Pipelines dataset code.

Examples of Apache Spark operations that should never be used in Lakeflow Declarative Pipelines code:

  • collect()
  • count()
  • toPandas()
  • save()
  • saveAsTable()
  • start()
  • toTable()