CREATE TABLE [USING]
Applies to: Databricks SQL Databricks Runtime
Defines a managed or external table, optionally using a data source.
Syntax
{ { [CREATE OR] REPLACE TABLE | CREATE [EXTERNAL] TABLE [ IF NOT EXISTS ] }
table_name
[ table_specification ]
[ USING data_source ]
[ table_clauses ]
[ AS query ] }
table_specification
( { column_identifier column_type [ column_properties ] } [, ...]
[ , table_constraint ] [...] )
column_properties
{ NOT NULL |
GENERATED ALWAYS AS ( expr ) |
GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ] |
DEFAULT default_expression |
COMMENT column_comment |
column_constraint |
MASK clause } [ ... ]
table_clauses
{ OPTIONS clause |
PARTITIONED BY clause |
CLUSTER BY clause |
clustered_by_clause |
LOCATION path [ WITH ( CREDENTIAL credential_name ) ] |
COMMENT table_comment |
TBLPROPERTIES clause |
WITH { ROW FILTER clause } } [...]
clustered_by_clause
{ CLUSTERED BY ( cluster_column [, ...] )
[ SORTED BY ( { sort_column [ ASC | DESC ] } [, ...] ) ]
INTO num_buckets BUCKETS }
Parameters
REPLACE
If specified, replaces the table and its content if it already exists. This clause is only supported for Delta Lake tables.
REPLACE
preserves the table history.Note
Databricks strongly recommends using
REPLACE
instead of dropping and re-creating Delta Lake tables.EXTERNAL
If specified, creates an external table. When creating an external table you must also provide a
LOCATION
clause. When an external table is dropped the files at theLOCATION
will not be dropped.IF NOT EXISTS
If specified and a table with the same name already exists, the statement is ignored.
IF NOT EXISTS
cannot coexist withREPLACE
, which meansCREATE OR REPLACE TABLE IF NOT EXISTS
is not allowed.-
The name of the table to be created. The name must not include a temporal specification. If the name is not qualified the table is created in the current schema.
Tables created in
hive_metastore
can only contain alphanumeric ASCII characters and underscores (INVALID_SCHEMA_OR_RELATION_NAME). table_specification
This optional clause defines the list of columns, their types, properties, descriptions, and column constraints.
If you do not define columns the table schema you must specify either
AS query
orLOCATION
.-
A unique name for the column.
Column identifiers of Delta Lake tables without column mapping property (
'delta.columnMapping.mode' = 'name'
) must not contain spaces or the following characters:, ; { } ( ) \n \t =
.Column identifiers of
AVRO
table must start with an underscore (_
) or a Unicode letter (including non-ASCII letters) and be followed by a combination of Unicode letters, digits, and underscores. -
Specifies the data type of the column. Not all data types supported by Databricks are supported by all data sources.
NOT NULL
If specified the column will not accept
NULL
values. This clause is only supported for Delta Lake tables.GENERATED ALWAYS AS ( expr )
When you specify this clause the value of this column is determined by the specified
expr
.expr
may be composed of literals, column identifiers within the table, and deterministic, built-in SQL functions or operators except:Table valued generator functions
Also
expr
must not contain any subquery.GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ]
Applies to: Databricks SQL Databricks Runtime 10.4 LTS and above
Defines an identity column. When you write to the table, and do not provide values for the identity column, it will be automatically assigned a unique and statistically increasing (or decreasing if
step
is negative) value. This clause is only supported for Delta Lake tables. This clause can only be used for columns with BIGINT data type.The automatically assigned values start with
start
and increment bystep
. Assigned values are unique but are not guaranteed to be contiguous. Both parameters are optional, and the default value is 1.step
cannot be0
.If the automatically assigned values are beyond the range of the identity column type, the query will fail.
When
ALWAYS
is used, you cannot provide your own values for the identity column.The following operations are not supported:
PARTITIONED BY
an identity columnUPDATE
an identity column
Note
Declaring an identity column on a Delta table disables concurrent transactions. Only use identity columns in use cases where concurrent writes to the target table are not required.
DEFAULT default_expression
Applies to: Databricks SQL Databricks Runtime 11.3 LTS and above
Defines a
DEFAULT
value for the column which is used onINSERT
,UPDATE
, andMERGE ... INSERT
when the column is not specified.If no default is specified
DEFAULT NULL
is applied for nullable columns.default_expression
may be composed of literals, and built-in SQL functions or operators except:Table valued generator functions
Also
default_expression
must not contain any subquery.DEFAULT
is supported forCSV
,JSON
,PARQUET
, andORC
sources.COMMENT column_comment
A string literal to describe the column.
-
Adds a primary key or foreign key constraint to the column in a Delta Lake table.
Constraints are not supported for tables in the
hive_metastore
catalog.To add a check constraint to a Delta Lake table use ALTER TABLE.
-
Applies to: Databricks SQL Databricks Runtime 12.2 LTS and above Unity Catalog only
Preview
This feature is in Public Preview.
Adds a column mask function to anonymize sensitive data. All future queries from that column will receive the result of evaluating that function over the column in place of the column’s original value. This can be useful for fine-grained access control purposes wherein the function can inspect the identity and/or group memberships of the invoking user in order to decide whether to redact the value.
-
Adds an informational primary key or informational foreign key constraints to the Delta Lake table.
Key constraints are not supported for tables in the
hive_metastore
catalog.To add a check constraint to a Delta Lake table use ALTER TABLE.
-
USING data_source
The file format to use for the table.
data_source
must be one of:AVRO
BINARYFILE
CSV
DELTA
JSON
ORC
PARQUET
TEXT
The following additional file formats to use for the table are supported in Databricks Runtime:
JDBC
LIBSVM
a fully-qualified class name of a custom implementation of
org.apache.spark.sql.sources.DataSourceRegister
.
If
USING
is omitted, the default isDELTA
.For any
data_source
other thanDELTA
you must also specify aLOCATION
unless the table catalog ishive_metastore
.The following applies to: Databricks Runtime
HIVE
is supported to create a Hive SerDe table in Databricks Runtime. You can specify the Hive-specificfile_format
androw_format
using theOPTIONS
clause, which is a case-insensitive string map. Theoption_keys
are:FILEFORMAT
INPUTFORMAT
OUTPUTFORMAT
SERDE
FIELDDELIM
ESCAPEDELIM
MAPKEYDELIM
LINEDELIM
table_clauses
Optionally specify location, partitioning, clustering, options, comments, and user defined properties for the new table. Each sub clause may only be specified once.
-
An optional clause to partition the table by a subset of columns.
Note
If you don’t define a Delta table, partitioning columns are placed at the end of the table, even if they are defined earlier in the column specification. Consider using
CLUSTER BY
instead ofPARTITIONED BY
for Delta tables. -
Applies to: Databricks SQL Databricks Runtime 13.3 and later
An optional clause to cluster a Delta table by a subset of columns. To cluster other tables use
clustered_by_clause
.Delta Lake liquid clustering cannot be combined with
PARTITIONED BY
. clustered_by_clause
Optionally cluster the table or each partition into a fixed number of hash buckets using a subset of the columns.
Clustering is not supported for Delta Lake tables.
CLUSTERED BY
Specifies the set of columns by which to cluster each partition, or the table if no partitioning is specified.
-
An identifier referencing a
column_identifier
in the table. If you specify more than one column there must be no duplicates. Since a clustering operates on the partition level you must not name a partition column also as a cluster column.
-
SORTED BY
Optionally maintains a sort order for rows in a bucket.
sort_column
A column to sort the bucket by. The column must not be partition column. Sort columns must be unique.
ASC or DESC
Optionally specifies whether
sort_column
is sorted in ascending (ASC
) or descending (DESC
) order. The default values isASC
.
INTO num_buckets BUCKETS
An INTEGER literal specifying the number of buckets into which each partition (or the table if no partitioning is specified) is divided.
LOCATION path [ WITH ( CREDENTIAL credential_name ) ]
An optional path to the directory where table data is stored, which could be a path on distributed storage.
path
must be a STRING literal. If you specify no location the table is considered amanaged table
and Databricks creates a default table location.Specifying a location makes the table an external table.
For tables that do not reside in the
hive_metastore
catalog, the tablepath
must be protected by an external location unless a valid storage credential is specified.You cannot create external tables in locations that overlap with the location of managed tables.
For a Delta Lake table the table configuration is inherited from the
LOCATION
if data is present. Therefore, if anyTBLPROPERTIES
,table_specification
, orPARTITIONED BY
clauses are specified for Delta Lake tables they must exactly match the Delta Lake location data.-
Sets or resets one or more user defined table options.
COMMENT table_comment
A string literal to describe the table.
-
Optionally sets one or more user defined properties.
WITH ROW FILTER clause
Applies to: Databricks SQL Databricks Runtime 12.2 LTS and above Unity Catalog only
Adds a row filter function to the table. All future queries from that table will receive subset of its rows for which the function evaluates to boolean TRUE. This can be useful for fine-grained access control purposes wherein the function can inspect the identity and/or group memberships of the invoking user in order to decide whether to filter certain rows.
-
AS query
This optional clause populates the table using the data from
query
. When you specify aquery
you must not also specify atable_specification
. The table schema is derived from the query.Note that Databricks overwrites the underlying data source with the data of the input query, to make sure the table gets created contains exactly the same data as the input query.
Examples
-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);
-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;
-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
COMMENT 'this is a comment'
TBLPROPERTIES ('foo'='bar');
-- Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
TBLPROPERTIES ('foo'='bar')
COMMENT 'this is a comment';
-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
PARTITIONED BY (age);
-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
area INT GENERATED ALWAYS AS (a * b));