Use materialized views in Databricks SQL
This article describes how to create and refresh materialized views in Databricks SQL to improve performance and reduce the cost of your data processing and analysis workloads.
What are materialized views?
In Databricks SQL, materialized views are Unity Catalog managed tables that physically store the results of a query. Unlike standard views, which compute results on demand, materialized views cache the results and update them as the underlying source tables change—either on a schedule or automatically.
Materialized views are well-suited for data processing workloads such as extract, transform, and load (ETL) processing. Materialized views provide a simple, declarative way to process data for compliance, corrections, aggregations, or general change data capture (CDC). Materialized views also enable easy-to-use transformations by cleaning, enriching, and denormalizing base tables. By pre-computing expensive or frequently used queries, Materialized views lower query latency and resource consumption. In many cases, they can incrementally compute changes from source tables, further improving efficiency and end-user experience.
The following are common use cases for materialized views:
- Keeping a BI dashboard up to date with minimal end-user query latency.
- Reducing complex ETL orchestration with simple SQL logic.
- Building complex, layered transformations.
- Any use cases that demand consistent performance with up-to-date insights.
When you create a materialized view in a Databricks SQL warehouse, a serverless pipeline is created to process the create and refreshes to the materialized view. You can monitor the status of refresh operations in Catalog Explorer. See View materialized view details with DESCRIBE EXTENDED
.
Requirements
Materialized views created in Databricks SQL are backed by a serverless DLT pipeline. Your workspace must support serverless pipelines to use this functionality.
Requirements to create or refresh materialized views:
-
You must use a Unity Catalog-enabled pro or serverless SQL warehouse.
-
To refresh a materialized view, you must be in the workspace that created it.
-
To incrementally refresh a materialized view from Delta tables, the source tables must have row tracking enabled.
-
The owner (the user who creates the materialized view) must have the following permissions:
SELECT
privilege on the base tables referenced by the materialized view.USE CATALOG
andUSE SCHEMA
privileges on the catalog and schema containing the source tables for the materialized view.USE CATALOG
andUSE SCHEMA
privileges on the target catalog and schema for the materialized view.CREATE TABLE
andCREATE MATERIALIZED VIEW
privileges on the schema containing the materialized view.
-
To refresh a materialized view, you must have the
REFRESH
privilege on the materialized view.
- Your workspace must be in a region that supports serverless SQL warehouses.
- You must have accepted the serverless terms of use.
Requirements to query materialized views:
-
You must be the owner of the materialized view, or have
SELECT
on the materialized view, along withUSE SCHEMA
andUSE CATALOG
on its parents. -
You must use one of the following compute resources:
-
SQL warehouse
-
DLT interfaces
-
Standard access mode compute (formerly shared access mode)
-
Dedicated access mode (formerly single user access mode) on Databricks Runtime 15.4 and above, as long as the workspace is enabled for serverless compute. See Fine-grained access control on dedicated compute (formerly single user compute).
If you are the materialized view owner, you can use a Dedicated access mode compute resource that is running Databricks Runtime between 14.3 and above.
-
To learn about other restrictions on using materialized views, see Limitations.
Create a materialized view
Databricks SQL materialized view CREATE
operations use a Databricks SQL warehouse to create and load data in the materialized view. Creating a materialized view is a synchronous operation, which means that the CREATE MATERIALIZED VIEW
command blocks until the materialized view is created and the initial data load finishes. A serverless DLT pipeline is automatically created for every Databricks SQL materialized view. When the materialized view is refreshed the DLT pipeline processes the refresh.
To create a materialized view, use the CREATE MATERIALIZED VIEW
statement. To submit a create statement, use the SQL editor in the Databricks UI, the Databricks SQL CLI, or the Databricks SQL API.
The user who creates a materialized view is the materialized view owner.
The following example creates the materialized view mv1
from the base table base_table1
:
-- This query defines the materialized view:
CREATE OR REPLACE MATERIALIZED VIEW mv1
AS SELECT
date,
sum(sales) AS sum_of_sales
FROM
base_table1
GROUP BY
date;
When you create a materialized view using the CREATE OR REPLACE MATERIALIZED VIEW
statement, the initial data refresh and population begins immediately. This does not consume SQL warehouse compute. Instead, serverless DLT is used for creation and subsequent refreshes.
Column comments on a base table are automatically propagated to the new materialized view on create only. To add a schedule, table constraints, or other properties, modify the materialized view definition (the SQL query).
The same SQL statement will refresh a materialized view if called a subsequent time, or on a schedule. A refresh done in this way acts as any other refresh. For details, see Refresh a materialized view.
To learn more about configuring a materialized view, see Configure materialized views in Databricks SQL. To learn about the complete syntax for creating a materialized view, see CREATE MATERIALIZED VIEW. To learn about loading data in different formats and from different places, see Load data with DLT.
Load data from external systems
Databricks recommends loading external data using Lakehouse Federation for supported data sources. For information on loading data from sources not supported by Lakehouse Federation, see Data format options. For general information about loading data, including exmaples, see Load data with DLT.
Hide sensitive data
This feature is in Public Preview.
You can use materialized views to hide sensitive data from users accessing the table. One way to do this is to create the query so it doesn't include that data in the first place. But you can also mask columns or filter rows based on the permissions of the querying user. For example, you could hide the tax_id
column for users that are not in the group HumanResourcesDept
. To do this, use the ROW FILTER
and MASK
syntax during the creation of the materialized view. For more information, see Filter sensitive table data using row filters and column masks.
Refresh a materialized view
Refreshing a materialized view updates the view to reflect the latest changes to the base table at the time of the refresh.
When you define a materialized view, the CREATE OR REPLACE MATERIALIZED VIEW
statement is used both to create the view, and to refresh it for any scheduled refreshes. You can also use the REFRESH MATERIALIZED VIEW
statement to refresh the materialized view without needing to supply the query again. See REFRESH (MATERIALIZED VIEW or STREAMING TABLE) for details on the SQL syntax and parameters for this command. To learn more about the types of materialized views that can be incrementally refreshed, see Incremental refresh for materialized views.
To submit a refresh statement, use the SQL editor in the Databricks UI, a notebook attached to a SQL warehouse, the Databricks SQL CLI, or the Databricks SQL API.
The owner, and any user who has been granted the REFRESH
privilege on the table, can refresh the materialized view.
The following example refreshes the mv1
materialized view:
REFRESH MATERIALIZED VIEW mv1;
The operation is synchronous by default, meaning that the command blocks until the refresh operation is complete. To refresh asynchronously, you can add the ASYNC
keyword:
REFRESH MATERIALIZED VIEW mv1 ASYNC;
How are Databricks SQL materialized views refreshed?
Materialized views automatically create and use serverless DLT pipelines to process refresh operations. The refresh is managed by the DLT pipeline and the update is monitored by the Databricks SQL warehouse used to create the materialized view. Materialized views can be updated using a DLT pipeline that runs on a schedule. Databricks SQL created materialized views always run in triggered mode. See Triggered vs. continuous pipeline mode.
Materialized views are refreshed using one of two methods.
- Incremental refresh - The system evaluates the view's query to identify changes that happened after the last update and merges only the new or modified data.
- Full refresh - If an incremental refresh can't be performed, the system runs the entire query and replaces the existing data in the materialized view with the new results.
The structure of the query and the type of source data determine whether incremental refresh is supported. To support incremental refresh, source data should be stored in Delta tables, with row tracking and change data feed enabled. After you create a materialized view, you can monitor its refresh behavior to verify whether it is updated incrementally or through a full refresh.
For details on refresh types, and how to optimize for incremental refreshes, see Incremental refresh for materialized views.
Asynchronous refreshes
By default, refresh operations are performed synchronously. You can also set a refresh operation to occur asynchronously. This can be set using the refresh command with the ASYNC
keyword. See REFRESH (MATERIALIZED VIEW or STREAMING TABLE) The behavior associated with each approach is as follows:
- Synchronous: A synchronous refresh prevents other operations from proceeding until the refresh is complete. If the result is needed for the next step, such as when sequencing refresh operations in orchestration tools like Databricks Jobs, use a synchronous refresh. To orchestrate materialized views with a job, use the SQL task type. See Orchestration using Databricks Jobs.
- Asynchronous: An asynchronous refresh starts a background job on DLT compute when a materialized view refresh begins, allowing the command to return before the data load completes. This refresh type can save on cost because the operation doesn’t necessarily hold compute capacity in the warehouse where the command is initiated. If the refresh becomes idle and no other tasks are running, the warehouse can shut down while the refresh uses other available compute. Additionally, asynchronous refreshes support starting multiple operations in parallel.
Schedule materialized view refreshes
You can configure a Databricks SQL materialized view to refresh automatically based on a defined schedule. To set a schedule, do one of the following:
- Configure the schedule with the
SCHEDULE
clause when you create the materialized view - Add a schedule with the ALTER MATERIALIZED VIEW statement.
Alternately, you may create a task in a job that includes either the CREATE OR REPLACE MATERIALIZED VIEW
or a REFRESH
statement, and orchestrate it as you would any other job. See Orchestration using Databricks Jobs.
The following example creates the materialized view mv1
from the base table base_table1
, and a schedule to refresh the materialized view once an hour:
CREATE OR REPLACE MATERIALIZED VIEW mv1
SCHEDULE EVERY 1 hour
AS SELECT
date,
sum(sales) AS sum_of_sales
FROM
base_table1
GROUP BY
date;
To set or change the schedule after creation, use the ALTER MATERIALIZED VIEW
statement:
ALTER MATERIALIZED VIEW sales ALTER SCHEDULE EVERY 1 day;
When a schedule is created, a new Databricks job is automatically configured to process the update.
To view the schedule, do one of the following:
- Run the
DESCRIBE EXTENDED
statement from the SQL editor in the Databricks UI. See DESCRIBE TABLE. - Use Catalog Explorer to view the materialized view. The schedule is listed on the Overview tab, under Refresh status. See What is Catalog Explorer?.
When there is a schedule for refreshes, you still have the option to run a manual refresh at any time, if you have a need for updated data.
Stop an active refresh
To stop an active refresh in the DLT UI, in the Pipeline details page click Stop to stop the pipeline update. You can also stop the refresh with the Databricks CLI or the POST /api/2.0/pipelines/{pipeline_id}/stop operation in the Pipelines API.
Permanently delete records from a materialized view with deletion vectors enabled
Support for the REORG
statement with materialized views is in Public Preview.
- Using a
REORG
statement with a materialized view requires Databricks Runtime 15.4 and above. - Although you can use the
REORG
statement with any materialized view, it’s only required when deleting records from a materialized view with deletion vectors enabled. The command has no effect when used with a materialized view without deletion vectors enabled.
To physically delete records from the underlying storage for a materialized view with deletion vectors enabled, such as for GDPR compliance, additional steps must be taken to ensure that a VACUUM operation runs on the materialized view‘s data.
To physically delete records:
- Run a
REORG
statement against the materialized view, specifying theAPPLY (PURGE)
parameter. For exampleREORG TABLE <materialized-view-name> APPLY (PURGE);
. See REORG TABLE. - Wait for the materialized view‘s data retention period to pass. The default data retention period is seven days, but it can be configured with the
delta.deletedFileRetentionDuration
table property. See Configure data retention for time travel queries. REFRESH
the materialized view. See Refresh a materialized view. Within 24 hours of theREFRESH
operation, DLT maintenance tasks, including theVACUUM
operation which is required to ensure records are permanently deleted, are run automatically.
Drop a materialized view
To submit the command to drop a materialized view, you must be the owner of that materialized view or have the MANAGE
privilege on the materialized view.
To drop a materialized view, use the DROP VIEW statement. To submit a DROP
statement, you can use the SQL editor in the Databricks UI, the Databricks SQL CLI, or the Databricks SQL API. The following example drops the mv1
materialized view:
DROP MATERIALIZED VIEW mv1;
You can also use the Catalog Explorer to drop a materialized view.
- Click
Catalog in the sidebar.
- In the Catalog Explorer tree at the left, open the catalog and select the schema where your materialized view is located.
- Open the Tables item under the schema you selected, and click the materialized view.
- On the kebab menu
, select Delete.
Understand the costs of a materialized view
Because an materialized view runs in serverless compute, outside of the compute that you set up for a notebook or job, you may wonder how to understand the costs associated with it. Materialized view usage is tracked by DBU consumption. To learn more, see What is the DBU consumption of a materialized view or streaming table?
Enabling row tracking
In order to support incremental refreshes from Delta tables, row tracking must be enabled for those source tables. If you recreate a source table, you must re-enable row tracking.
The following example shows how to enable row tracking on a table:
ALTER TABLE source_table SET TBLPROPERTIES (delta.enableRowTracking = true);
For more details, see Use row tracking for Delta tables
Limitations
- For compute and workspace requirements, see Requirements.
- For incremental refresh requirements, see Incremental refresh for materialized views.
- Materialized views do not support identity columns or surrogate keys.
- If a materialized view uses a sum aggregate over a
NULL
-able column and onlyNULL
values remain in that column, the materialized views resultant aggregate value is zero instead ofNULL
. - You cannot read a change data feed from a materialized view.
- Time travel queries are not supported on materialized views.
- The underlying files supporting materialized views might include data from upstream tables (including possible personally identifiable information) that do not appear in the materialized view definition. This data is automatically added to the underlying storage to support incremental refreshing of materialized views. Because the underlying files of a materialized view might risk exposing data from upstream tables not part of the materialized view schema, Databricks recommends not sharing the underlying storage with untrusted downstream consumers. For example, suppose the definition of a materialized view includes a
COUNT(DISTINCT field_a)
clause. Even though the materialized view definition only includes the aggregateCOUNT DISTINCT
clause, the underlying files will contain a list of the actual values offield_a
. - You may incur some serverless compute charges, even when using these features on dedicated compute.
- If you need to use an AWS PrivateLink connection with your materialized view, contact your Databricks representative.