Use Python with standalone pipelines
You can create and refresh standalone materialized views and streaming tables from a notebook using Python. Author your pipeline in a Python notebook and run them with spark.sql(). This lets you manage standalone pipelines alongside your other Python-based notebook workflows.
Python source for standalone pipelines requires a notebook attached to serverless general compute. You can't use Python to create or refresh standalone pipelines from a Databricks SQL warehouse, because a warehouse runs SQL statements, not Python notebooks. To use a SQL warehouse instead, see Use standalone materialized views and Use standalone streaming tables.
Requirements
To create and refresh standalone pipelines with Python, you need a notebook attached to serverless general compute on Databricks Runtime 18.1 or above. For the complete list of requirements, including regional availability and permissions, see Notebooks.
How it works
In a Python notebook, pass the same statements you would run from a Databricks SQL warehouse to spark.sql(). The standalone materialized view and streaming table syntax is identical; only the way you submit the statement differs. As with a warehouse, each CREATE or REFRESH statement runs a serverless pipeline to process the operation.
The spark session is available by default in Databricks notebooks, so no import is required.
Create a materialized view
The following example creates the materialized view mv1 from the base table base_table1:
spark.sql("""
CREATE OR REPLACE MATERIALIZED VIEW mv1
AS SELECT
date,
sum(sales) AS sum_of_sales
FROM base_table1
GROUP BY date
""")
For full CREATE MATERIALIZED VIEW details, such as scheduled and triggered refreshes, see Create a materialized view.
Create a streaming table
The following example creates the streaming table sales from the raw_data table:
spark.sql("""
CREATE OR REFRESH STREAMING TABLE sales
AS SELECT product, price FROM STREAM raw_data
""")
For full CREATE STREAMING TABLE details, including loading files with Auto Loader and scheduling, see Use standalone streaming tables.
Refresh a materialized view or streaming table
Use a REFRESH statement to update a standalone table with the latest data from its source:
spark.sql("REFRESH MATERIALIZED VIEW mv1")
spark.sql("REFRESH STREAMING TABLE sales")
On serverless general compute, refreshes are synchronous. Asynchronous refreshes (the ASYNC keyword) are not supported. See Serverless general compute.
Parameterize statements
To pass values from your Python code into a statement instead of hardcoding them, use named parameter markers in the SQL and supply their values through the args argument of spark.sql(). Use a marker such as :min_sales directly for literal values. Wrap the marker in IDENTIFIER() only when the parameter is an object name, such as a table, view, or schema, because identifiers can't be substituted as plain string values.
The following example parameterizes both the materialized view name and a filter value:
mv_name = "main.sales.regional_sales"
min_sales = 1000
spark.sql("""
CREATE OR REPLACE MATERIALIZED VIEW IDENTIFIER(:mv)
AS SELECT
region,
sum(sales) AS sum_of_sales
FROM base_table1
WHERE sales > :min_sales
GROUP BY region
""", args={
"mv": mv_name,
"min_sales": min_sales,
})
For more information, see Parameter markers and IDENTIFIER clause.
Run other statements
You can run any standalone materialized view or streaming table statement from a Python notebook by passing it to spark.sql(), including statements to schedule refreshes, alter a table, or drop a table. To understand how to use materialized views and streaming tables, including SQL syntax, see Use standalone materialized views and Use standalone streaming tables.
Limitations
Standalone materialized views and streaming tables created on serverless general compute have additional limitations, such as no support for asynchronous refreshes and no per-table cost attribution. For the full list, see Serverless general compute.