Skip to main content

Choose between SQL and Python

Lakeflow Spark Declarative Pipelines (SDP) supports both SQL and Python interfaces for defining batch and streaming pipelines. Both interfaces produce the same underlying dataflow graph, so they provide equivalent functionality for most data processing. They differ in flexibility, accessibility, and feature coverage.

Use this guidance to decide which interface to use:

  • If you can express your logic in SQL, use SQL.
  • If you need programmatic control or a Python-only feature, use Python.
  • If you're more comfortable with Python, use Python. It covers the full pipeline feature set, so familiarity is reason enough. The same isn't true in reverse: SQL doesn't cover every feature, so don't choose it on familiarity alone.

You can also combine both interfaces in the same pipeline. See Mix SQL and Python.

When to use SQL

SQL is a good fit when you want:

  • Readable, declarative definitions: Clear logic that data engineers and analysts can maintain.
  • Standard table types: Pipelines built mostly from streaming tables and materialized views.
  • Linear transformation chains: Straightforward ingestion and transformation, such as a bronze-to-silver-to-gold flow, without procedural logic.
  • Standalone tables: Standalone streaming tables or materialized views, which you author in SQL.

For an overview of developing pipelines in SQL, see Develop Lakeflow Spark Declarative Pipelines code with SQL.

When to use Python

Python is a good fit when you need:

  • Programmatic control: Loops, conditionals, and metaprogramming to generate pipeline definitions dynamically.
  • External libraries: Python packages such as faker or boto3. See Manage Python dependencies for pipelines.
  • User-defined functions (UDFs): You define UDFs in Python and can call them from both Python and SQL source files. See User-defined scalar functions - Python.
  • Python-only features:
    • create_auto_cdc_from_snapshot_flow() to apply change data capture from a database snapshot.
    • create_sink() and foreach_batch_sink() to write to external event streaming or Delta destinations.

For an overview of developing pipelines in Python, see Develop pipeline code with Python.

Mix SQL and Python

A single pipeline can combine SQL and Python definitions, but each language must be in a separate source file. For example, you can define your bronze and silver tables in Python and your gold tables in SQL.

Feature availability

The following table compares how each interface supports common pipeline features:

Feature

SQL

Python

streaming table

CREATE STREAMING TABLE

create_streaming_table(), table()

materialized view

CREATE MATERIALIZED VIEW

materialized_view()

Temporary view

CREATE TEMPORARY VIEW

temporary_view()

Private table

CREATE PRIVATE STREAMING TABLE, CREATE PRIVATE MATERIALIZED VIEW

table(private=True)

Auto CDC

AUTO CDC ... INTO

create_auto_cdc_flow()

Auto CDC from snapshot

Not supported

create_auto_cdc_from_snapshot_flow()

Flow

CREATE FLOW

append_flow()

Sink

Not supported

create_sink(), foreach_batch_sink()

Expectations

CONSTRAINT ... EXPECT

expect(), expect_or_drop(), expect_or_fail(), and the expect_all variants

Decision summary

If you need...

Goal

Recommended interface

Simplicity and readability

SQL

Quick declarative setup

SQL

Standalone streaming table or materialized view

SQL

Conditional or looping logic

Python

UDFs or external Python libraries

Python

Auto CDC from snapshot or sinks

Python

Full programmatic control and modularity

Python