Use materialized views in Databricks SQL
Preview
This feature is in Public Preview. To sign up for access, fill out this form.
This article describes how to create and use materialized views in Databricks SQL to improve performance and reduce the cost of your data processing and analysis workloads.
What are materialized views?
In Databricks SQL, materialized views are Unity Catalog managed tables that allow users to precompute results based on the latest version of data in source tables. Materialized views on Databricks differ from other implementations as the results returned reflect the state of data when the materialized view was last refreshed rather than always updating results when the materialized view is queried. You can manually refresh materialized views or schedule refreshes.
Materialized views are powerful for data processing workloads such as extract, transform, and load (ETL) processing. Materialized views provide a simple, declarative way to process data for compliance, corrections, aggregations, or general change data capture (CDC). Materialized views reduce cost and improve query latency by pre-computing slow queries and frequently used computations. Materialized views also enable easy-to-use transformations by cleaning, enriching, and denormalizing base tables. Materialized views can reduce costs while providing a simplified end-user experience because, in some cases, they can incrementally compute changes from the base tables.
Materialized views were first supported in the Databricks Lakehouse with the launch of Delta Live Tables. When you create a materialized view in a Databricks SQL warehouse, a Delta Live Tables pipeline is created to process refreshes to the materialized view. You can monitor the status of refresh operations in the Delta Live Tables UI, the Delta Live Tables API, or the Delta Live Tables CLI. See View the status of a materialized view refresh.
Requirements
You must use a Unity Catalog-enabled workspace to create and refresh materialized views.
To create Databricks SQL materialized views your account must be enabled to use serverless SQL warehouses.
To learn about restrictions when using materialized views with Databricks SQL, see Limitations.
Create a materialized view
To create a materialized view, use the CREATE MATERIALIZED VIEW
statement. See CREATE MATERIALIZED VIEW in the Databricks SQL reference. To submit a create statement, use the SQL editor in the Databricks UI, the Databricks SQL CLI, or the Databricks SQL API.
Note
The user who creates a materialized view is the materialized view owner and needs to have the following permissions:
SELECT
privilege on the base tables referenced by the materialized view.USE CATALOG
andUSE SCHEMA
privileges on the catalog and schema containing the source tables for the materialized view.USE CATALOG
andUSE SCHEMA
privileges on the target catalog and schema for the materialized view.CREATE TABLE
andCREATE MATERIALIZED VIEW
privileges on the schema containing the materialized view.
The following example creates the materialized view mv1
from the base table base_table1
:
CREATE MATERIALIZED VIEW mv1
AS SELECT
date, sum(sales) AS sum_of_sales
FROM
table1
GROUP BY
date
How are materialized views created?
Databricks SQL materialized view CREATE
operations use a Databricks SQL warehouse to create and load data in the materialized view. Because creating a materialized view is a synchronous operation in the Databricks SQL warehouse, the CREATE MATERIALIZED VIEW
command blocks until the materialized view is created and the initial data load finishes. A Delta Live Tables pipeline is automatically created for every Databricks SQL materialized view. When the materialized view is refreshed, an update to the Delta Live Tables pipeline is started to process the refresh.
Load data from external systems
Databricks recommends loading external data using Lakehouse Federation for supported data sources. For information on loading data from sources not supported by Lakehouse Federation, see Interact with external data on Databricks.
Refresh a materialized view
The REFRESH
operation refreshes the materialized view to reflect the latest changes to the base table. To refresh a materialized view, use the REFRESH MATERIALIZED VIEW
statement. See REFRESH (MATERIALIZED VIEW and STREAMING TABLE) in the Databricks SQL reference. To submit a refresh statement, use the SQL editor in the Databricks UI, the Databricks SQL CLI, or the Databricks SQL API.
Only the owner can REFRESH
the materialized view.
The following example refreshes the mv1
materialized view:
REFRESH MATERIALIZED VIEW mv1
How are Databricks SQL materialized views refreshed?
Databricks SQL materialized views use Delta Live Tables for refresh operations. When the materialized view is refreshed, an update to the Delta Live Tables pipeline managing the materialized view is started to process the refresh.
Databricks SQL materialized view refreshes are asynchronous. When a materialized view refresh is initiated, a background job starts on a Delta Live Tables cluster and the command returns immediately before the data load completes.
Because the refresh is managed by a Delta Live Tables pipeline, the Databricks SQL warehouse used to create the materialized view is not used and does not need to be running during the refresh operation.
Some queries can be incrementally refreshed. See Incremental refresh of materialized views. If an incremental refresh cannot be performed, a full refresh is performed instead.
Schedule materialized view refreshes
You can configure a Databricks SQL materialized view to refresh automatically based on a defined schedule. Configure this schedule with the SCHEDULE
clause when you create the materialized view or add a schedule with the ALTER VIEW statement. When a schedule is created, a new Databricks job is automatically configured to process the update. You can view the schedule any time with the DESCRIBE EXTENDED
statement.
Update the definition of a materialized view
To update the definition of a materialized view, you must first drop, then re-create the materialized view.
Drop a materialized view
Note
To submit the command to drop a materialized view, you must be the owner of that materialized view.
To drop a materialized view, use the DROP VIEW statement. To submit a DROP
statement, you can use the SQL editor in the Databricks UI, the Databricks SQL CLI, or the Databricks SQL API. The following example drops the mv1
materialized view:
DROP MATERIALIZED VIEW mv1
Describe a materialized view
To retrieve the columns and data types for a materialized view, use the DESCRIBE
statement. To retrieve the columns, data types, and metadata such as owner, location, creation time, and refresh status for a materialized view, use DESCRIBE EXTENDED
. To submit a DESCRIBE
statement, use the SQL editor in the Databricks UI, the Databricks SQL CLI, or the Databricks SQL API.
View the status of a materialized view refresh
Note
Because a Delta Live Tables pipeline manages materialized view refreshes, there is latency incurred by the startup time for the pipeline. This time might be in the seconds to minutes, in addition to the time required to perform the refresh.
You can view the status of a materialized view refresh by viewing the pipeline that manages the materialized view in the Delta Live Tables UI or by viewing the Refresh Information returned by the DESCRIBE EXTENDED
command for the materialized view.
View the refresh status in the Delta Live Tables UI
By default, the Delta Live Tables pipeline that manages a materialized view is not visible in the Delta Live Tables UI. To view the pipeline in the Delta Live Tables UI, you must directly access the link to the pipeline’s Pipeline details page. To access the link:
If you submit the
REFRESH
command in the SQL editor, follow the link in the Results panel.Follow the link returned by the
DESCRIBE EXTENDED
statement.On the lineage tab for the materialized view, click Pipelines and then click the pipeline link.
Stop an active refresh
To stop an active refresh in the Delta Live Tables UI, in the Pipeline details page click Stop to stop the pipeline update. You can also stop the refresh with the Databricks CLI or the POST /api/2.0/pipelines/{pipeline_id}/stop operation in the Pipelines API.
Control access to materialized views
Materialized views support rich access controls to support data-sharing while avoiding exposing potentially private data. A materialized view owner can grant SELECT
privileges to other users. Users with SELECT
access to the materialized view do not need SELECT
access to the tables referenced by the materialized view. This access control enables data sharing while controlling access to the underlying data.
Grant privileges to a materialized view
To grant access to a materialized view, use the GRANT
statement:
GRANT
privilege_type [, privilege_type ] ...
ON <mv_name> TO principal
The privilege_type can be:
SELECT
- the user can SELECT the materialized view.
The following example creates a materialized view and grants select privileges to a user:
CREATE MV <mv_name> AS “SELECT FROM <base_table>”
GRANT SELECT ON <mv_name> TO user
Revoke privileges from a materialized view
To revoke access from a materialized view, use the REVOKE
statement:
REVOKE
privilege_type [, privilege_type ]
ON <name> FROM principal
When SELECT
privileges on a base table are revoked from the materialized view owner or any other user who has been granted SELECT
privileges to the materialized view, or the base table is dropped, the materialized view owner or user granted access is still able to query the materialized view. However, the following behavior occurs:
The materialized view owner or others who have lost access to a materialized view can no longer
REFRESH
that materialized view, and the materialized view will become stale.If automated with a schedule, the next scheduled
REFRESH
fails or is not run.
The following example revokes the SELECT
privilege from mv1
:
REVOKE SELECT ON mv1 FROM user1;
Incremental refresh of materialized views
Materialized views always return the semantically correct results for the defining query based on the latest snapshot version of data available in base tables at the time of the last refresh. When possible, results are updated incrementally, but results are identical to those that would be delivered by full recomputation.
While any query can be defined in a materialized view, only a subset of queries are incrementally refreshed. If the materialized view cannot be incrementally refreshed, the refresh process uses a full refresh instead. To determine which mode is used, see Determining if an incremental or full refresh is used.
What type of materialized views are incrementally refreshed?
The following describes the expressions, keywords, and clauses supporting incremental refresh.
Note
Materialized views that query certain types of base tables have a greater scope of incremental refresh. See (Advanced) Defining materialized views that can be incrementally refreshed.
The materialized view can query only a single table or perform an
INNER JOIN
andUNION ALL
(or combinations ofINNER JOIN
andUNION ALL
) on multiple tables.The materialized view must not have a
GROUP BY
clause at the top-level. For example,table_a JOIN (table_b group by ...)
is not incrementally refreshed.The materialized view
SELECT
clause supports the following aggregate functions. Any aggregate function not in this list is not supported:SUM
COUNT
Incremental refresh is not supported for materialized views that include:
Window functions.
HAVING
clauses.Subqueries in
SELECT
orWHERE
clauses.
Change data feed is required on the base tables, except for certain advanced use cases. See Enabling change data feed.
Functions used in the query to create a materialized view must be deterministic. For example, using
CURRENT_TIMESTAMP
is not permitted.Materialized views cannot be created using the Delta Lake time travel feature.
LEFT JOIN
s andOUTER JOIN
s are not supported.
Enabling change data feed
Change data feed can be enabled on a base table by setting the delta.enableChangeDataFeed
table property using the following syntax:
ALTER TABLE table1 SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
Change data feed is not enabled by default on materialized views. To enable the change data feed on a materialized view created in Databricks SQL, specify the appropriate table setting at creation time. If you have an existing materialized view, you must drop it and re-create it.
To enable change data feed on a materialized view created in Delta Live Tables, you (or the pipeline owner) must modify the Delta Live Tables source code to add the table property.
The following example enables change data feed on a materialized view:
CREATE MATERIALIZED VIEW <mv_name>
TBLPROPERTIES(delta.enableChangeDataFeed = true)
As <query>
(Advanced) Defining materialized views that can be incrementally refreshed
The following describes advanced use cases where materialized views can be incrementally refreshed. Change data feed is not required for incremental refresh in the following advanced use cases.
Note
The following limitations still apply to the use cases described here:
Functions used in the query to create a materialized view must be deterministic. For example, using
CURRENT_TIMESTAMP
is not permitted.Materialized views cannot be created using the Delta Lake time travel feature.
LEFT JOIN
s andOUTER JOIN
s are not supported.
Queries over tables with partitions
Materialized views can leverage partitions to avoid full refreshes. When a materialized view shares the same partition keys as its base tables, the materialized view can detect changed partitions and modify only the partitions in the materialized view that are necessary to materialize the result. For large tables, this might save significant time and resources.
The following are the expressions, keywords, and clauses that support incremental refresh when defining materialized views on tables with partitions. In the following scenarios, a materialized view is “co-partitioned” with a base table if the materialized view shares at least one partition key with the base tables.
A partitioned materialized view can query a single table.
The materialized view must be co-partitioned with the base table.
A partitioned materialized view can do a
UNION ALL
over multiple tables.The materialized view must be co-partitioned with all base tables.
A partitioned materialized view can use an
INNER JOIN
on a fact table with a dimension table.The materialized view must be co-partitioned with the fact table.
The dimension table is not required to be partitioned. If an update is detected on the dimension table, the materialized view is fully refreshed.
Queries over tables that are append-only
A common pattern for tables in the bronze layer is that new rows are only inserted into the table. Materialized views automatically detect append-only base tables and incrementally refresh by inserting only new rows into the materialized view. For large tables, this might save significant time and resources.
The following are required to support incremental refresh when defining materialized views on top of append-only tables:
A materialized view can query only a single table or
INNER JOIN
andUNION ALL
(or combinations ofINNER JOIN and UNION ALL
) on multiple tables.The base tables must be append-only.
If a row is updated or deleted into the base table of the materialized view by an external system, the materialized view is fully refreshed.
Determining if an incremental or full refresh is used
To optimize the performance of materialized view refreshes, Databricks uses a cost model to select the technique used for the refresh. The following table describes these techniques:
Technique |
Incremental refresh? |
Description |
---|---|---|
|
No |
The materialized view was fully recomputed |
|
Not applicable |
The materialized view was not updated because no changes to the base table were detected. |
|
Yes |
The materialized view was incrementally refreshed using the specified technique. |
To determine the technique used, query the Delta Live Tables event log where the event_type
is planning_information
:
SELECT
timestamp,
message
FROM
event_log(TABLE(<fully qualified MV name>))
WHERE
event_type = 'planning_information'
ORDER BY
timestamp desc
Limitations
There are restrictions on how MVs can be managed and where they can be queried:
Databricks SQL materialized views can only be created and refreshed in pro SQL warehouses and serverless SQL warehouses.
Databricks SQL materialized views can only be queried from Databricks SQL warehouses, Delta Live Tables, and shared clusters running Databricks Runtime 11.3 or greater. You cannot query materialized views from Single User access mode clusters.
The base tables for the materialized view must be registered in Unity Catalog as managed or external tables.
The owner of the Databricks SQL materialized view cannot be changed.
Materialized views do not support identity columns or surrogate keys.
You cannot run ad hoc
OPTIMIZE
orVACUUM
commands against materialized views.If a materialized view uses a sum aggregate over a
NULL
-able column and onlyNULL
values remain in that column, the materialized views resultant aggregate value is zero instead ofNULL
.Although column references do not require aliases, non-column reference expressions require an alias. For example, the following statement is not allowed:
SELECT col1, SUM(col2) FROM t GROUP BY col1
. Instead, use the following statement:SELECT col1, SUM(col2) AS sum_col2 FROM t GROUP BY col1
.You cannot grant the
REFRESH
permission on a materialized view in the Permissions tab for the materialized view in Catalog Explorer.The underlying files supporting materialized views might include data from upstream tables (including possible personally identifiable information) that do not appear in the materialized view definition. This data is automatically added to the underlying storage to support incremental refreshing of materialized views. Because the underlying files of a materialized view might risk exposing data from upstream tables not part of the materialized view schema, Databricks recommends not sharing the underlying storage with untrusted downstream consumers. For example, suppose the definition of a materialized view includes a
COUNT(DISTINCT field_a)
clause. Even though the materialized view definition only includes the aggregateCOUNT DISTINCT
clause, the underlying files will contain a list of the actual values offield_a
.