Skip to main content

Use Python with standalone pipelines

You can create and refresh standalone materialized views and streaming tables from a notebook using Python. Author your pipeline in a Python notebook and run them with spark.sql(). This lets you manage standalone pipelines alongside your other Python-based notebook workflows.

Python source for standalone pipelines requires a notebook attached to serverless general compute. You can't use Python to create or refresh standalone pipelines from a Databricks SQL warehouse, because a warehouse runs SQL statements, not Python notebooks. To use a SQL warehouse instead, see Use standalone materialized views and Use standalone streaming tables.

Beta

Creating and refreshing standalone materialized views and streaming tables from a notebook on serverless general compute is in Beta and available in select regions. See Notebooks.

Requirements

To create and refresh standalone pipelines with Python, you need a notebook attached to serverless general compute on Databricks Runtime 18.1 or above. For the complete list of requirements, including regional availability and permissions, see Notebooks.

How it works

In a Python notebook, pass the same statements you would run from a Databricks SQL warehouse to spark.sql(). The standalone materialized view and streaming table syntax is identical; only the way you submit the statement differs. As with a warehouse, each CREATE or REFRESH statement runs a serverless pipeline to process the operation.

The spark session is available by default in Databricks notebooks, so no import is required.

Create a materialized view

The following example creates the materialized view mv1 from the base table base_table1:

Python
spark.sql("""
CREATE OR REPLACE MATERIALIZED VIEW mv1
AS SELECT
date,
sum(sales) AS sum_of_sales
FROM base_table1
GROUP BY date
""")

For full CREATE MATERIALIZED VIEW details, such as scheduled and triggered refreshes, see Create a materialized view.

Create a streaming table

The following example creates the streaming table sales from the raw_data table:

Python
spark.sql("""
CREATE OR REFRESH STREAMING TABLE sales
AS SELECT product, price FROM STREAM raw_data
""")

For full CREATE STREAMING TABLE details, including loading files with Auto Loader and scheduling, see Use standalone streaming tables.

Refresh a materialized view or streaming table

Use a REFRESH statement to update a standalone table with the latest data from its source:

Python
spark.sql("REFRESH MATERIALIZED VIEW mv1")
spark.sql("REFRESH STREAMING TABLE sales")

On serverless general compute, refreshes are synchronous. Asynchronous refreshes (the ASYNC keyword) are not supported. See Serverless general compute.

Parameterize statements

To pass values from your Python code into a statement instead of hardcoding them, use named parameter markers in the SQL and supply their values through the args argument of spark.sql(). Use a marker such as :min_sales directly for literal values. Wrap the marker in IDENTIFIER() only when the parameter is an object name, such as a table, view, or schema, because identifiers can't be substituted as plain string values.

The following example parameterizes both the materialized view name and a filter value:

Python
mv_name = "main.sales.regional_sales"
min_sales = 1000

spark.sql("""
CREATE OR REPLACE MATERIALIZED VIEW IDENTIFIER(:mv)
AS SELECT
region,
sum(sales) AS sum_of_sales
FROM base_table1
WHERE sales > :min_sales
GROUP BY region
""", args={
"mv": mv_name,
"min_sales": min_sales,
})

For more information, see Parameter markers and IDENTIFIER clause.

Run other statements

You can run any standalone materialized view or streaming table statement from a Python notebook by passing it to spark.sql(), including statements to schedule refreshes, alter a table, or drop a table. To understand how to use materialized views and streaming tables, including SQL syntax, see Use standalone materialized views and Use standalone streaming tables.

Limitations

Standalone materialized views and streaming tables created on serverless general compute have additional limitations, such as no support for asynchronous refreshes and no per-table cost attribution. For the full list, see Serverless general compute.

Additional resources