Use materialized views in Databricks SQL
Note
If you need to use an AWS PrivateLink connection with your materialized view, contact your Databricks representative.
This article describes how to create and use materialized views in Databricks SQL to improve performance and reduce the cost of your data processing and analysis workloads.
What are materialized views?
In Databricks SQL, materialized views are Unity Catalog managed tables that allow users to precompute results based on the latest version of data in source tables. Materialized views on Databricks differ from other implementations as the results returned reflect the state of data when the materialized view was last refreshed rather than always updating results when the materialized view is queried. You can manually refresh materialized views or schedule refreshes.
Materialized views are powerful for data processing workloads such as extract, transform, and load (ETL) processing. Materialized views provide a simple, declarative way to process data for compliance, corrections, aggregations, or general change data capture (CDC). Materialized views reduce cost and improve query latency by pre-computing slow queries and frequently used computations. Materialized views also enable easy-to-use transformations by cleaning, enriching, and denormalizing base tables. Materialized views can reduce costs while providing a simplified end-user experience because, in some cases, they can incrementally compute changes from the base tables.
Materialized views were first supported in Databricks with the launch of Delta Live Tables. When you create a materialized view in a Databricks SQL warehouse, a Delta Live Tables pipeline is created to process refreshes to the materialized view. You can monitor the status of refresh operations in the Delta Live Tables UI, the Delta Live Tables API, or the Delta Live Tables CLI. See View the status of a materialized view refresh.
Requirements
To create or refresh materialized views:
You must use a Unity Catalog-enabled pro or serverless SQL warehouse.
To refresh a materialized view, you must be in the workspace that created it.
Your workspace must be in a region that supports serverless SQL warehouses.
You must have accepted the serverless terms of use.
To query materialized views:
You must be the owner of the materialized view, or have
SELECT
on the materialized view, along withUSE SCHEMA
andUSE CATALOG
on its parents.You must use one of the following compute resources:
SQL warehouse
Delta Live Tables interfaces
Shared access mode compute
Single-user access mode on Databricks Runtime 15.4 and above, as long as the workspace is enabled for serverless compute. See Fine-grained access control on single-user compute
Only if you are the materialized view owner: a single-user access mode compute resource that is running Databricks Runtime 15.3 and below.
To learn about other restrictions on using materialized views, see Limitations.
Create a materialized view
Databricks SQL materialized view CREATE
operations use a Databricks SQL warehouse to create and load data in the materialized view. Creating a materialized view is a synchronous operation, which means that the CREATE MATERIALIZED VIEW
command blocks until the materialized view is created and the initial data load finishes. A serverless Delta Live Tables pipeline is automatically created for every Databricks SQL materialized view. When the materialized view is refreshed the Delta Live Tables pipeline processes the refresh.
To create a materialized view, use the CREATE MATERIALIZED VIEW
statement. To submit a create statement, use the SQL editor in the Databricks UI, the Databricks SQL CLI, or the Databricks SQL API.
For details on syntax and parameters, see CREATE MATERIALIZED VIEW.
Note
The user who creates a materialized view is the materialized view owner and needs to have the following permissions:
SELECT
privilege on the base tables referenced by the materialized view.USE CATALOG
andUSE SCHEMA
privileges on the catalog and schema containing the source tables for the materialized view.USE CATALOG
andUSE SCHEMA
privileges on the target catalog and schema for the materialized view.CREATE TABLE
andCREATE MATERIALIZED VIEW
privileges on the schema containing the materialized view.
The following example creates the materialized view mv1
from the base table base_table1
:
CREATE MATERIALIZED VIEW mv1
AS SELECT
date, sum(sales) AS sum_of_sales
FROM
table1
GROUP BY
date;
Load data from external systems
Databricks recommends loading external data using Lakehouse Federation for supported data sources. For information on loading data from sources not supported by Lakehouse Federation, see Data format options.
Refresh a materialized view
The REFRESH
operation refreshes the materialized view to reflect the latest changes to the base table. The operation is synchronous by default, meaning that the command blocks until the refresh operation is complete. To refresh a materialized view, use the REFRESH MATERIALIZED VIEW
statement. See REFRESH (MATERIALIZED VIEW or STREAMING TABLE) for details on the SQL syntax and parameters for this command. To learn more about the types of materialized views that can be incrementally refreshed, see Refresh operations for materialized views.
To submit a refresh statement, use the SQL editor in the Databricks UI, a notebook attached to a SQL warehouse, the Databricks SQL CLI, or the Databricks SQL API.
Only the owner can REFRESH
the materialized view.
The following example refreshes the mv1
materialized view:
REFRESH MATERIALIZED VIEW mv1;
How are Databricks SQL materialized views refreshed?
Materialized views automatically create and use serverless Delta Live Tables pipelines to process refresh operations. The refresh is managed by the Delta Live Tables pipeline and the update is monitored by the Databricks SQL warehouse used to create the materialized view.
Delta Live Tables pipelines use either a continuous or triggered execution mode. Materialized views can be updated in either execution mode. To avoid unnecessary processing when operating in continuous execution mode, pipelines automatically monitor dependent Delta tables and perform an update only when the contents of those dependent tables have changed. See What is a Delta Live Tables pipeline?.
Note
The Delta Live Tables runtime cannot detect changes in non-Delta data sources. The table is still updated regularly but with a higher default trigger interval to prevent excessive recomputation from slowing down any incremental processing happening on compute.
By default, refresh operations are performed synchronously. You can also set a refresh operation to occur asynchronously. The behavior associated with each approach is as follows:
Synchronous: A synchronous refresh blocks other operations until the refresh operation is complete. This allows you to sequence refresh operations in an orchestration tool, like Jobs. To orchestrate materialized views with Jobs, use the SQL task type. See Schedule and orchestrate Workflows.
Asynchronous: An asynchronous refresh starts a background job on Delta Live Tables compute when a materialized view refresh begins, and the command returns before the data load is complete.
Some queries can be incrementally refreshed. See Refresh operations for materialized views. If an incremental refresh cannot be performed, a full refresh is performed instead.
Schedule materialized view refreshes
You can configure a Databricks SQL materialized view to refresh automatically based on a defined schedule. To set a schedule, do one of the following:
Configure the schedule with the
SCHEDULE
clause when you create the materialized viewAdd a schedule with the ALTER MATERIALIZED VIEW statement.
When a schedule is created, a new Databricks job is automatically configured to process the update.
To view the schedule, do one of the following:
Run the
DESCRIBE EXTENDED
statement from the SQL editor in the Databricks UI.Use Catalog Explorer to view the materialized view. The schedule is listed on the Overview tab, under Refresh status. See What is Catalog Explorer?.
View the status of a materialized view refresh
Note
Because a Delta Live Tables pipeline manages materialized view refreshes, there is latency incurred by the startup time for the pipeline. This time might be in the seconds to minutes, in addition to the time required to perform the refresh.
You can view the status of a materialized view refresh by viewing the pipeline that manages the materialized view in the Delta Live Tables UI or by viewing the Refresh Information returned by the DESCRIBE EXTENDED
command for the materialized view.
You can also view the refresh history of a materialized view by querying the Delta Live Tables event log. See View the refresh history for a materialized view.
View the refresh status in the Delta Live Tables UI
By default, the Delta Live Tables pipeline that manages a materialized view is not visible in the Delta Live Tables UI. To view the pipeline in the Delta Live Tables UI, you must directly access the link to the pipeline’s Pipeline details page. To access the link:
Copy and paste the link shown in the Latest Refresh row of the table returned by the
DESCRIBE EXTENDED
statement.On the lineage tab for the materialized view, click Pipelines and then click the pipeline link.
For asynchronous REFRESH
commands submitted using the SQL editor in the Databricks UI, you can view the refresh status by following the link shown in the Results panel.
Stop an active refresh
To stop an active refresh in the Delta Live Tables UI, in the Pipeline details page click Stop to stop the pipeline update. You can also stop the refresh with the Databricks CLI or the POST /api/2.0/pipelines/{pipeline_id}/stop operation in the Pipelines API.
Update the definition of a materialized view
To update the definition of a materialized view, you must first drop, then re-create the materialized view.
Drop a materialized view
Note
To submit the command to drop a materialized view, you must be the owner of that materialized view.
To drop a materialized view, use the DROP VIEW statement. To submit a DROP
statement, you can use the SQL editor in the Databricks UI, the Databricks SQL CLI, or the Databricks SQL API. The following example drops the mv1
materialized view:
DROP MATERIALIZED VIEW mv1;
Describe a materialized view
To retrieve the columns and data types for a materialized view, use the DESCRIBE
statement. To retrieve the columns, data types, and metadata such as owner, location, creation time, and refresh status for a materialized view, use DESCRIBE EXTENDED
. To submit a DESCRIBE
statement, use the SQL editor in the Databricks UI, the Databricks SQL CLI, or the Databricks SQL API.
Change the owner of a materialized view
You can change the owner of an materialized view if you are a both a metastore admin and a workspace admin. Materialized views automatically create and use Delta Live Tables pipelines to process changes. Use the following steps to change an materialized views owner:
On the lineage tab for the materialized view, click Pipelines and then click the pipeline link.
Click the kebab menu to the right of the pipeline name and click Permissions. This opens the permissions dialog.
Click x to the right of the current owner’s name to remove the current owner.
Start typing to filter the list of available users. Click the user who should be the new pipeline owner.
Click Save to save your changes and close the dialog.
All pipeline assets, including materialized views defined in the pipeline, are owned by the new pipeline owner. All future updates are run using the new owner’s identity.
Control access to materialized views
Materialized views support rich access controls to support data-sharing while avoiding exposing potentially private data. A materialized view owner can grant SELECT
privileges to other users. Users with SELECT
access to the materialized view do not need SELECT
access to the tables referenced by the materialized view. This access control enables data sharing while controlling access to the underlying data.
Grant privileges to a materialized view
To grant access to a materialized view, use the GRANT
statement:
GRANT
privilege_type [, privilege_type ] ...
ON <mv_name> TO principal;
The privilege_type can be:
SELECT
- the user canSELECT
the materialized view.REFRESH
- the user canREFRESH
the materialized view. Refreshes are run using the owner’s permissions.
The following example creates a materialized view and grants select and refresh privileges to a user:
CREATE MATERIALIZED VIEW <mv_name> AS SELECT * FROM <base_table>;
GRANT SELECT ON <mv_name> TO user;
GRANT REFRESH ON <mv_name> TO user;
Revoke privileges from a materialized view
To revoke access from a materialized view, use the REVOKE
statement:
REVOKE
privilege_type [, privilege_type ]
ON <name> FROM principal;
When SELECT
privileges on a base table are revoked from the materialized view owner or any other user who has been granted SELECT
privileges to the materialized view, or the base table is dropped, the materialized view owner or user granted access is still able to query the materialized view. However, the following behavior occurs:
The materialized view owner or others who have lost access to a materialized view can no longer
REFRESH
that materialized view, and the materialized view will become stale.If automated with a schedule, the next scheduled
REFRESH
fails or is not run.
The following example revokes the SELECT
privilege from mv1
:
REVOKE SELECT ON mv1 FROM user1;
Enable change data feed
Change data feed is required on the materialized views base tables, except for certain advanced use cases. To enable change data feed on a base table, set the delta.enableChangeDataFeed
table property using the following syntax:
ALTER TABLE table1 SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
View the refresh history for a materialized view
To view the status of REFRESH
operations on a materialized view, including current and past refreshes, query the Delta Live Tables event log:
SELECT
*
FROM
event_log(TABLE(<fully-qualified-table-name>))
WHERE
event_type = "update_progress"
ORDER BY
timestamp desc;
Replace <fully-qualified-table-name>
with the fully qualified name of the materialized view, including the catalog and schema.
Determining if an incremental or full refresh is used
To optimize the performance of materialized view refreshes, Databricks uses a cost model to select the technique used for the refresh. The following table describes these techniques:
Technique |
Incremental refresh? |
Description |
---|---|---|
|
No |
The materialized view was fully recomputed |
|
Not applicable |
The materialized view was not updated because no changes to the base table were detected. |
|
Yes |
The materialized view was incrementally refreshed using the specified technique. |
To determine the technique used, query the Delta Live Tables event log where the event_type
is planning_information
:
SELECT
timestamp,
message
FROM
event_log(TABLE(<fully-qualified-table-name>))
WHERE
event_type = 'planning_information'
ORDER BY
timestamp desc;
Replace <fully-qualified-table-name>
with the fully qualified name of the materialized view, including the catalog and schema.
Limitations
For compute and workspace requirements, see Requirements.
Materialized views do not support identity columns or surrogate keys.
If a materialized view uses a sum aggregate over a
NULL
-able column and onlyNULL
values remain in that column, the materialized views resultant aggregate value is zero instead ofNULL
.You cannot read a change data feed from a materialized view.
The underlying files supporting materialized views might include data from upstream tables (including possible personally identifiable information) that do not appear in the materialized view definition. This data is automatically added to the underlying storage to support incremental refreshing of materialized views. Because the underlying files of a materialized view might risk exposing data from upstream tables not part of the materialized view schema, Databricks recommends not sharing the underlying storage with untrusted downstream consumers. For example, suppose the definition of a materialized view includes a
COUNT(DISTINCT field_a)
clause. Even though the materialized view definition only includes the aggregateCOUNT DISTINCT
clause, the underlying files will contain a list of the actual values offield_a
.