CREATE STREAMING TABLE

Applies to: Databricks SQL

Creates a streaming table, a Delta table with extra support for streaming or incremental data processing.

Streaming tables are only supported in Lakeflow pipelines and on Databricks SQL with Unity Catalog. Running this command on supported Databricks Runtime compute only parses the syntax. See Develop Lakeflow pipelines code with SQL.

Syntax

{ CREATE OR REFRESH STREAMING TABLE | CREATE STREAMING TABLE [ IF NOT EXISTS ] }
  table_name
  [ table_specification ]
  [ table_clauses ]
  [ {flow_clause | AS query} ]

table_specification
  ( { column_identifier column_type [column_properties] } [, ...]
    [ CONSTRAINT expectation_name EXPECT (expectation_expr)
      [ ON VIOLATION { FAIL UPDATE | DROP ROW } ] ] [, ...]
    [ , table_constraint ] [...] )

column_properties
  { NOT NULL |
    GENERATED ALWAYS AS ( expr ) |
    GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start | INCREMENT BY step ] [ ...] ) ] |
    DEFAULT default_expression |
    COMMENT column_comment |
    column_constraint |
    MASK clause } [ ... ]

table_clauses
  { PARTITIONED BY (col [, ...]) |
    CLUSTER BY clause |
    COMMENT table_comment |
    DEFAULT COLLATION UTF8_BINARY |
    TBLPROPERTIES clause |
    schedule |
    WITH { ROW FILTER clause } } [...]

flow_clause
  FLOW { { INSERT BY NAME query } |
  { AUTO CDC auto_cdc_flow_spec } |
  { REPLACE WHERE predicate BY NAME query } }

schedule
  { SCHEDULE [ REFRESH ] schedule_clause |
    TRIGGER ON UPDATE [ AT MOST EVERY trigger_interval ] }

schedule_clause
  { EVERY number { HOUR | HOURS | DAY | DAYS | WEEK | WEEKS } |
  CRON cron_string [ AT TIME ZONE timezone_id ]}

Parameters

REFRESH

If specified, refreshes the table with the latest data available from the sources defined in the query. Only new data that arrives before the query starts is processed. New data that gets added to the sources during the execution of the command is ignored until the next refresh. The refresh operation from CREATE OR REFRESH is fully declarative. If a refresh command does not specify all metadata from the original table creation statement, the unspecified metadata is deleted.
IF NOT EXISTS

Creates the streaming table if it does not exist. If a table by this name already exists, the CREATE STREAMING TABLE statement is ignored.

You may specify at most one of IF NOT EXISTS or OR REFRESH.
table_name

The name of the table to be created. The name must not include a temporal specification or options specification. If the name is not qualified the table is created in the current schema.
table_specification

This optional clause defines the list of columns, their types, properties, descriptions, and column constraints.

If you do not define columns in the table schema you must specify AS query.
- column_identifier
  
  A unique name for the column.
  - column_type
    
    Specifies the data type of the column.
  - NOT NULL
    
    If specified the column does not accept NULL values.
  - GENERATED ALWAYS AS ( expr )
    
    When you specify this clause the value of this column is determined by the specified expr.
    
    The DEFAULT COLLATION of the table must be UTF8_BINARY.
    
    expr may be composed of literals, column identifiers within the table, and deterministic, built-in SQL functions or operators except:
    - Aggregate functions
    - Analytic window functions
    - Ranking window functions
    - Table valued generator functions
    - Columns with a collation other than UTF8_BINARY
    Also expr must not contain any subquery.
  - GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ]
    
    Applies to: Databricks SQL Databricks Runtime 10.4 LTS and above
    
    Defines an identity column. When you write to the table, and do not provide values for the identity column, it will be automatically assigned a unique and statistically increasing (or decreasing if step is negative) value. This clause is only supported for Delta tables. This clause can only be used for columns with BIGINT data type.
    
    The automatically assigned values start with start and increment by step. Assigned values are unique but are not guaranteed to be contiguous. Both parameters are optional, and the default value is 1. step cannot be 0.
    
    If the automatically assigned values are beyond the range of the identity column type, the query will fail.
    
    When ALWAYS is used, you cannot provide your own values for the identity column.
    
    The following operations are not supported:
    - PARTITIONED BY an identity column
    - UPDATE an identity column
    注記
    Declaring an identity column on a table disables concurrent transactions. Only use identity columns in use cases where concurrent writes to the target table are not required.
  - DEFAULT default_expression
    
    Applies to: Databricks SQL Databricks Runtime 11.3 LTS and above
    
    Defines a DEFAULT value for the column which is used on INSERT, UPDATE, and MERGE ... INSERT when the column is not specified.
    
    If no default is specified DEFAULT NULL is applied for nullable columns.
    
    default_expression may be composed of literals, and built-in SQL functions or operators except:
    - Aggregate functions
    - Analytic window functions
    - Ranking window functions
    - Table valued generator functions
    Also default_expression must not contain any subquery.
    
    DEFAULT is supported for CSV, JSON, PARQUET, and ORC sources.
  - COMMENT column_comment
    
    A string literal to describe the column.
  - column_constraint
    
    Adds a primary key or foreign key constraint to the column in a streaming table. Constraints are not supported for tables in the hive_metastore catalog.
  - MASK clause
    
    Adds a column mask function to anonymize sensitive data. All subsequent queries from that column receive the result of evaluating that function over the column in place of the column's original value. This can be useful for fine-grained access control purposes where the function can inspect the identity or group memberships of the invoking user to decide whether to redact the value.
  - CONSTRAINT expectation_name EXPECT (expectation_expr) [ ON VIOLATION { FAIL UPDATE | DROP ROW } ]
    
    Adds data quality expectations to the table. These data quality expectations can be tracked over time and accessed through the streaming table's event log. A FAIL UPDATE expectation causes the processing to fail when both creating the table as well as refreshing the table. A DROP ROW expectation causes the entire row to be dropped if the expectation is not met.
    
    If you omit ON VIOLATION, the expectation uses the default warn action. Violating rows are retained and the number of violations is recorded in the event log's ExpectationMetrics object.
    
    expectation_expr may be composed of literals, column identifiers within the table, and deterministic, built-in SQL functions or operators except:
    - Aggregate functions
      - Analytic window functions
      - Ranking window functions
      - Table valued generator functions
    Also expr must not contain any subquery.
  - table_constraint
    
    Adds an informational primary key or informational foreign key constraints to a streaming table. Key constraints are not supported for tables in the hive_metastore catalog.
table_clauses

Optionally specify partitioning, comments, user defined properties, and a refresh schedule for the new table. Each sub clause may only be specified once.
- PARTITIONED BY
  
  An optional list of columns of the table to partition the table by.
  
  注記
  Liquid clustering provides a flexible, optimized solution for clustering. Consider using CLUSTER BY instead of PARTITIONED BY for streaming tables.
- CLUSTER BY
  
  An optional clause to cluster by a subset of columns. Use automatic liquid clustering with CLUSTER BY AUTO, and Databricks intelligently chooses clustering keys to optimize query performance. See Use liquid clustering for tables.
  
  Liquid clustering cannot be combined with PARTITIONED BY.
- COMMENT table_comment
  
  A STRING literal to describe the table.
- DEFAULT COLLATION UTF8_BINARY
  
  Applies to: Databricks SQL Databricks Runtime 17.1 and above
  
  Forces the default collation of the streaming table to UTF8_BINARY. This clause is mandatory if the schema in which the table is created has a default collation other than UTF8_BINARY. The default collation of the streaming table is used as the default collation within the query and for column types.
- TBLPROPERTIES
  
  Optionally sets one or more user defined properties.
  
  Use this setting to specify the Lakeflow pipelines runtime channel used to run this statement. Set the value of the pipelines.channel property to "PREVIEW" or "CURRENT". The default value is "CURRENT". For more information about Lakeflow pipelines channels, see Lakeflow pipelines runtime channels.
- schedule
  
  The schedule can either be a SCHEDULE statement or a TRIGGER statement.
  - SCHEDULE [ REFRESH ] schedule_clause
    - EVERY number { HOUR | HOURS | DAY | DAYS | WEEK | WEEKS }
      
      To schedule a refresh that occurs periodically, use EVERY syntax. If EVERY syntax is specified, the streaming table or materialized view is refreshed periodically at the specified interval based on the provided value, such as HOUR, HOURS, DAY, DAYS, WEEK, or WEEKS. The following table lists accepted integer values for number.
      
      Time unit
      Integer value
      HOUR or HOURS
      1 <= H <= 72
      DAY or DAYS
      1 <= D <= 31
      WEEK or WEEKS
      1 <= W <= 8
      Time unit
      Integer value
      HOUR or HOURS
      1 <= H <= 72
      DAY or DAYS
      1 <= D <= 31
      WEEK or WEEKS
      1 <= W <= 8
      
      注記
      The singular and plural forms of the included time unit are semantically equivalent.
    - CRON cron_string [ AT TIME ZONE timezone_id ]
      
      To schedule a refresh using a quartz cron value. Valid time_zone_values are accepted. AT TIME ZONE LOCAL is not supported.
      
      The cron expression uses six space-separated fields in the order: seconds minutes hours day-of-month month day-of-week. Use ? for either day-of-month or day-of-week to leave it unspecified.
      
      For example, SCHEDULE CRON '0 0 0 * * ?' AT TIME ZONE 'UTC' refreshes daily at midnight UTC.
      
      If AT TIME ZONE is absent, the session time zone is used. If AT TIME ZONE is absent and the session time zone is not set, an error is thrown. SCHEDULE is semantically equivalent to SCHEDULE REFRESH.
    The schedule can be provided as part of the CREATE command. Use ALTER STREAMING TABLE or run CREATE OR REFRESH command with SCHEDULE clause to alter the schedule of a streaming table after creation.
  - TRIGGER ON UPDATE [ AT MOST EVERY trigger_interval ]
    
    Optionally set the table to refresh when an upstream data source is updated, at most once every minute. Set a value for AT MOST EVERY to require at least a minimum time between refreshes.
    
    The upstream data sources must be either external or managed Delta tables (including materialized views or streaming tables), or managed views whose dependencies are limited to supported table types.
    
    Enabling file events can make triggers more performant, and increases some of the limits on trigger updates.
    
    The trigger_interval is an INTERVAL statement that is at least 1 minute.
    
    TRIGGER ON UPDATE has the following limitations
    - No more than 10 upstream data sources per streaming table when using TRIGGER ON UPDATE.
    - Maximum of 1000 streaming tables or materialized views can be specified with TRIGGER ON UPDATE.
    - The AT MOST EVERY clause defaults to 1 minute, and cannot be less than 1 minute.
WITH ROW FILTER clause

Adds a row filter function to the table. All subsequent queries from that table receive a subset of the rows where the function evaluates to boolean TRUE. This can be useful for fine-grained access control purposes where the function can inspect the identity or group memberships of the invoking user to decide whether to filter certain rows.
FLOW

Beta
This feature is in Beta. Requires Databricks Runtime 17.3 and above.

Optionally defines a flow inline with the table creation. A flow is a stateful query that refreshes the contents of the table. If FLOW is not specified, you can use AS query instead. A separate REFRESH STREAMING TABLE statement lets you execute the flow. You can specify one of the following flow types:
- INSERT BY NAME
  
  Inserts data into the table by column name. The query must be a streaming query. Use the STREAM keyword to use streaming semantics to read from the source. If the read encounters a change or deletion to an existing record, an error is thrown. It is safest to read from static or append-only sources.
  注記
  FLOW INSERT BY NAME is equivalent to using AS query. The following two statements have identical behavior:
  SQL
  CREATE OR REFRESH STREAMING TABLE raw_data AS SELECT * FROM STREAM read_files('abfss://my_path'); CREATE OR REFRESH STREAMING TABLE raw_data FLOW INSERT BY NAME SELECT * FROM STREAM read_files('abfss://my_path');
- AUTO CDC
  
  Defines an AUTO CDC flow that processes change data capture (CDC) records from a source into the table. Use AUTO CDC when the source data includes CDC semantics. See CREATE STREAMING TABLE ... FLOW AUTO CDC.
- REPLACE WHERE predicate BY NAME query
  
  Beta
  FLOW REPLACE WHERE is in Beta.
  
  Defines a REPLACE WHERE flow that recomputes and overwrites only the rows matching predicate, leaving all other rows untouched. Use REPLACE WHERE for incremental batch processing of joins and aggregations, late-arriving data, schema evolution, and backfills. BY NAME is required. See REPLACE WHERE flows for standalone streaming tables.
AS query

This clause populates the table using the data from query. This query must be a streaming query. This can be achieved by adding the STREAM keyword to any relation you want to process incrementally. When you specify a query and a table_specification together, the table schema specified in table_specification must contain all the columns returned by the query, otherwise you get an error. Any columns specified in table_specification but not returned by query return null values when queried.

Time unit	Integer value
`HOUR or HOURS`	1 <= H <= 72
`DAY or DAYS`	1 <= D <= 31
`WEEK or WEEKS`	1 <= W <= 8

Time unit	Integer value
`HOUR or HOURS`	1 <= H <= 72
`DAY or DAYS`	1 <= D <= 31
`WEEK or WEEKS`	1 <= W <= 8

Differences between streaming tables and other tables

Streaming tables are stateful tables, designed to handle each row only once as you process a growing dataset. Because most datasets grow continuously over time, streaming tables are good for most ingestion workloads. Streaming tables are optimal for pipelines that require data freshness and low latency. Streaming tables can also be useful for massive scale transformations, as results can be incrementally calculated as new data arrives, keeping results up to date without needing to fully recompute all source data with each update. Streaming tables are designed for data sources that are append-only.

Streaming tables accept additional commands such as REFRESH, which processes the latest data available in the sources provided in the query. Changes to the provided query only get reflected on new data by calling a REFRESH, not previously processed data. To apply the changes on existing data as well, you need to execute REFRESH TABLE <table_name> FULL to perform a FULL REFRESH. Full refreshes re-process all data available in the source with the latest definition. It is not recommended to call full refreshes on sources that don't keep the entire history of the data or have short retention periods, such as Kafka, as the full refresh truncates the existing data. You may not be able to recover old data if the data is no longer available in the source.

Row filters and column masks

Row filters let you specify a function that applies as a filter whenever a table scan fetches rows. These filters ensure that subsequent queries only return rows for which the filter predicate evaluates to true.

Column masks let you mask a column's values whenever a table scan fetches rows. All future queries involving that column will receive the result of evaluating the function over the column, replacing the column's original value.

For more information on how to use row filters and column masks, see Row filters and column masks.

Managing Row Filters and Column Masks

Row filters and column masks on streaming tables should be added, updated, or dropped through the CREATE OR REFRESH statement.

Behavior

Refresh as Definer: When the CREATE OR REFRESH or REFRESH statements refresh a streaming table, row filter functions run with the definer's rights (as the table owner). This means the table refresh uses the security context of the user who created the streaming table.
Query: While most filters run with the definer's rights, functions that check user context (such as CURRENT_USER and IS_MEMBER) are exceptions. These functions run as the invoker. This approach enforces user-specific data security and access controls based on the current user's context.

Observability

Use DESCRIBE EXTENDED, INFORMATION_SCHEMA, or the Catalog Explorer to examine the existing row filters and column masks that apply to a given streaming table. This functionality allows users to audit and review data access and protection measures on streaming tables.

Limitations

Only table owners can refresh streaming tables to get the latest data.
ALTER TABLE commands are disallowed on streaming tables. The definition and properties of the table should be altered through the CREATE OR REFRESH or ALTER STREAMING TABLE statement.
Evolving the table schema through DML commands like INSERT INTO, and MERGE is not supported.
The following commands are not supported on streaming tables:
- CREATE TABLE ... CLONE <streaming_table> (you cannot use a streaming table as the source or target of a deep or shallow clone). See Clone a table on Databricks.
- COPY INTO
- ANALYZE TABLE
- RESTORE
- TRUNCATE
- GENERATE MANIFEST
- [CREATE OR] REPLACE TABLE
OpenSharing is not supported.
Renaming the table or changing the owner is not supported.
Table constraints such as PRIMARY KEY and FOREIGN KEY are not supported for streaming tables in the hive_metastore catalog.

Examples

SQL
-- Creates a streaming table that processes files stored in the given external location with
-- schema inference and evolution.
> CREATE OR REFRESH STREAMING TABLE raw_data
  AS SELECT * FROM STREAM read_files('abfss://container@storageAccount.dfs.core.windows.net/base/path');

-- Creates a streaming table that processes files with a known schema.
> CREATE OR REFRESH STREAMING TABLE csv_data (
    id int,
    ts timestamp,
    event string
  )
  AS SELECT *
  FROM STREAM read_files(
      's3://bucket/path',
      format => 'csv',
      schema => 'id int, ts timestamp, event string');

-- Creates a streaming table with an auto-incrementing identity column.
> CREATE OR REFRESH STREAMING TABLE customers_with_id (
    id BIGINT GENERATED ALWAYS AS IDENTITY,
    name string,
    region string
  )
  AS SELECT name, region FROM STREAM(customers_bronze);

-- Creates a streaming table with liquid clustering on order_date and customer_id.
> CREATE OR REFRESH STREAMING TABLE orders_with_cluster_by
  CLUSTER BY (order_date, customer_id)
  AS SELECT
    o_orderkey   AS order_id,
    o_custkey    AS customer_id,
    o_orderdate  AS order_date,
    o_totalprice AS total_price
  FROM STREAM(samples.tpch.orders);

-- Stores the data from Kafka in an append-only streaming table.
> CREATE OR REFRESH STREAMING TABLE firehose_raw
  COMMENT 'Stores the raw data from Kafka'
  TBLPROPERTIES ('delta.appendOnly' = 'true')
  AS SELECT
    value raw_data,
    offset,
    timestamp,
    timestampType
  FROM STREAM read_kafka(bootstrapServers => 'ips', subscribe => 'topic_name');

-- Creates a streaming table that scheduled to refresh when upstream data is updated.
-- The refresh frequency of triggered_data is at most once an hour.
> CREATE STREAMING TABLE triggered_data
  TRIGGER ON UPDATE AT MOST EVERY INTERVAL 1 hour
  AS SELECT *
  FROM STREAM source_stream_data;

-- Read data from another streaming table scheduled to run every hour.
> CREATE STREAMING TABLE firehose_bronze
  SCHEDULE EVERY 1 HOUR
  AS SELECT
    from_json(raw_data, 'schema_string') data,
    * EXCEPT (raw_data)
  FROM STREAM firehose_raw;

-- Creates a streaming table with schema evolution and data quality expectations.
-- The table creation or refresh fails if the data doesn't satisfy the expectation.
> CREATE OR REFRESH STREAMING TABLE avro_data (
    CONSTRAINT date_parsing EXPECT (to_date(dt) >= '2000-01-01') ON VIOLATION FAIL UPDATE
  )
  AS SELECT *
  FROM STREAM read_files('gs://my-bucket/avroData');

-- Sets the runtime channel to "PREVIEW"
> CREATE STREAMING TABLE st_preview
  TBLPROPERTIES(pipelines.channel = "PREVIEW")
  AS SELECT * FROM STREAM sales;

-- Creates a streaming table with a column constraint
> CREATE OR REFRESH STREAMING TABLE csv_data (
    id int PRIMARY KEY,
    ts timestamp,
    event string
  )
  AS SELECT *
  FROM STREAM read_files(
      's3://bucket/path',
      format => 'csv',
      schema => 'id int, ts timestamp, event string');

-- Creates a streaming table with a table constraint
> CREATE OR REFRESH STREAMING TABLE csv_data (
    id int,
    ts timestamp,
    event string,
    CONSTRAINT pk_id PRIMARY KEY (id)
  )
  AS SELECT *
  FROM STREAM read_files(
      's3://bucket/path',
      format => 'csv',
      schema => 'id int, ts timestamp, event string');

-- Creates a streaming table with a row filter and a column mask
> CREATE OR REFRESH STREAMING TABLE masked_csv_data (
    id int,
    name string,
    region string,
    ssn string MASK catalog.schema.ssn_mask_fn
  )
  WITH ROW FILTER catalog.schema.us_filter_fn ON (region)
  AS SELECT *
  FROM STREAM read_files('s3://bucket/path/sensitive_data')

-- Creates a streaming table using a FLOW to append data from files
> CREATE OR REFRESH STREAMING TABLE raw_data
  FLOW INSERT BY NAME SELECT * FROM STREAM read_files('abfss://my_path');

-- Creates a streaming table using an AUTO CDC flow to apply changes from a change feed
> CREATE OR REFRESH STREAMING TABLE target
  FLOW AUTO CDC
  FROM stream(cdc_data.users)
  KEYS (userId)
  SEQUENCE BY sequenceNum
  STORED AS SCD TYPE 1;

Syntax​

Parameters​

Differences between streaming tables and other tables​

Row filters and column masks​

Managing Row Filters and Column Masks​

Behavior​

Observability​

Limitations​

Examples​

Related articles​