Create Table
Important
This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported. See CREATE TABLE.
Create Table Using
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING data_source
[OPTIONS (key1 [ = ] val1, key2 [ = ] val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ...)]
[AS select_statement]
Create a table using a data source. If a table with the same name already exists in the database, an exception is thrown.
IF NOT EXISTS
If a table with the same name already exists in the database, nothing will happen.
USING data_source
The file format to use for the table. data_source
must be one of TEXT
, AVRO
, CSV
, JSON
, JDBC
, PARQUET
, ORC
, HIVE
, DELTA
, or LIBSVM
, or a fully-qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister
.
HIVE
is supported to create a Hive SerDe table. You can specify the Hive-specific file_format
and row_format
using the OPTIONS
clause, which is a case-insensitive string map. The option keys are FILEFORMAT
, INPUTFORMAT
, OUTPUTFORMAT
, SERDE
, FIELDDELIM
, ESCAPEDELIM
, MAPKEYDELIM
, and LINEDELIM
.
OPTIONS
Table options used to optimize the behavior of the table or configure HIVE
tables.
Note
This clause is not supported by Delta Lake.
PARTITIONED BY (col_name1, col_name2, ...)
Partition the created table by the specified columns. A directory is created for each partition.
CLUSTERED BY col_name3, col_name4, ...)
Each partition in the created table will be split into a fixed number of buckets by the specified columns. This is typically used with partitioning to read and shuffle less data.
LOCATION path
The directory to store the table data. This clause automatically implies EXTERNAL
.
Warning
If a schema (database) is registered in your workspace-level Hive metastore, dropping that schema using the CASCADE
option causes all files in that schema location to be deleted recursively, regardless of the table type (managed or external).
If the schema is registered to a Unity Catalog metastore, the files for Unity Catalog managed tables are deleted recursively. However, the files for external tables are not deleted. You must manage those files using the cloud storage provider directly.
Therefore, to avoid accidental data loss, you should never register a schema in a Hive metastore to a location with existing data. Nor should you create new external tables in a location managed by Hive metastore schemas or containing Unity Catalog managed tables.
AS select_statement
Populate the table with input data from the SELECT
statement. This cannot contain a column list.
Create Table Using Delta (Delta Lake on Databricks)
CREATE [OR REPLACE] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [NOT NULL] [COMMENT col_comment1], ...)]
USING DELTA
[LOCATION <path-to-delta-files>]
NOT NULL
Indicate that a column value cannot be NULL
. If specified, and an Insert or Update a table statements sets a column value to NULL
, a SparkException
is thrown. The default is to allow a NULL
value.
LOCATION <path-to-delta-files>
If you specify a LOCATION
that already contains data stored in Delta Lake, Delta Lake does the following:
If you specify only the table name and location, for example:
CREATE TABLE events USING DELTA LOCATION '/mnt/delta/events'
the table in the Hive metastore automatically inherits the schema, partitioning, and table properties of the existing data. This functionality can be used to “import” data into the metastore.
If you specify any configuration (schema, partitioning, or table properties), Delta Lake verifies that the specification exactly matches the configuration of the existing data.
Warning
If the specified configuration does not exactly match the configuration of the data, Delta Lake throws an exception that describes the discrepancy.
Examples
CREATE TABLE boxes (width INT, length INT, height INT) USING CSV
CREATE TABLE boxes
(width INT, length INT, height INT)
USING PARQUET
OPTIONS ('compression'='snappy')
CREATE TABLE rectangles
USING PARQUET
PARTITIONED BY (width)
CLUSTERED BY (length) INTO 8 buckets
AS SELECT * FROM boxes
-- CREATE a HIVE SerDe table using the CREATE TABLE USING syntax.
CREATE TABLE my_table (name STRING, age INT, hair_color STRING)
USING HIVE
OPTIONS(
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat',
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat',
SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe')
PARTITIONED BY (hair_color)
TBLPROPERTIES ('status'='staging', 'owner'='andrew')
Create Table with Hive format
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1[:] col_type1 [COMMENT col_comment1], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION path]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]
row_format:
: SERDE serde_cls [WITH SERDEPROPERTIES (key1=val1, key2=val2, ...)]
| DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char]
[NULL DEFINED AS char]
file_format:
: TEXTFILE | SEQUENCEFILE | RCFILE | ORC | PARQUET | AVRO
| INPUTFORMAT input_fmt OUTPUTFORMAT output_fmt
Create a table using the Hive format. If a table with the same name already exists in the database, an exception will be thrown. When the table is dropped later, its data will be deleted from the file system.
Note
This command is supported only when Hive support is enabled.
EXTERNAL
The table uses the custom directory specified with LOCATION
. Queries on the table access existing data previously stored in the directory. When an EXTERNAL
table is dropped, its data is not deleted from the file system. This flag is implied if LOCATION
is specified.
IF NOT EXISTS
If a table with the same name already exists in the database, nothing will happen.
PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)
Partition the table by the specified columns. This set of columns must be distinct from the set of non-partitioned columns. You cannot specify partitioned columns with AS select_statement
.
ROW FORMAT
Use the SERDE
clause to specify a custom SerDe for this table. Otherwise, use the DELIMITED
clause to use the native SerDe and specify the delimiter, escape character, null character, and so on.
STORED AS file_format
Specify the file format for this table. Available formats include TEXTFILE
, SEQUENCEFILE
, RCFILE
, ORC
, PARQUET
, and AVRO
. Alternatively, you can specify your own input and output formats through INPUTFORMAT
and OUTPUTFORMAT
. Only formats TEXTFILE
, SEQUENCEFILE
, and RCFILE
can be used with ROW FORMAT SERDE
and only TEXTFILE
can be used with ROW FORMAT DELIMITED
.
LOCATION path
The directory to store the table data. This clause automatically implies EXTERNAL
.
AS select_statement
Populate the table with input data from the select statement. You cannot specify this with PARTITIONED BY
.
Data types
Spark SQL supports the following data types:
Numeric types
ByteType
: Represents 1-byte signed integer numbers. The range of numbers is from-128
to127
.ShortType
: Represents 2-byte signed integer numbers. The range of numbers is from-32768
to32767
.IntegerType
: Represents 4-byte signed integer numbers. The range of numbers is from-2147483648
to2147483647
.LongType
: Represents 8-byte signed integer numbers. The range of numbers is from-9223372036854775808
to9223372036854775807
.FloatType
: Represents 4-byte single-precision floating point numbers.DoubleType
: Represents 8-byte double-precision floating point numbers.DecimalType
: Represents arbitrary-precision signed decimal numbers. Backed internally byjava.math.BigDecimal
. ABigDecimal
consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
String type:
StringType
: Represents character string values.Binary type:
BinaryType
: Represents byte sequence values.Boolean type:
BooleanType
: Represents boolean values.Datetime types
TimestampType
: Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local time zone. The timestamp value represents an absolute point in time.DateType
: Represents values comprising values of fields year, month and day, without a time-zone.
Complex types
ArrayType(elementType, containsNull)
: Represents values comprising a sequence of elements with the type ofelementType
.containsNull
is used to indicate if elements in aArrayType
value can havenull
values.MapType(keyType, valueType, valueContainsNull)
: Represents values comprising a set of key-value pairs. The data type of keys is described bykeyType
and the data type of values is described byvalueType
. For aMapType
value, keys are not allowed to havenull
values.valueContainsNull
is used to indicate if values of aMapType
value can havenull
values.StructType(fields)
: Represents values with the structure described by a sequence ofStructField
(fields
).StructField(name, dataType, nullable)
: Represents a field in aStructType
. The name of a field is indicated byname
. The data type of a field is indicated bydataType
.nullable
is used to indicate if values of these fields can havenull
values.
The following table shows the type names and aliases for each data type.
Data type |
SQL name |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Examples
CREATE TABLE my_table (name STRING, age INT)
CREATE TABLE my_table (name STRING, age INT)
COMMENT 'This table is partitioned'
PARTITIONED BY (hair_color STRING COMMENT 'This is a column comment')
TBLPROPERTIES ('status'='staging', 'owner'='andrew')
CREATE TABLE my_table (name STRING, age INT)
COMMENT 'This table specifies a custom SerDe'
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
CREATE TABLE my_table (name STRING, age INT)
COMMENT 'This table uses the CSV format'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
CREATE TABLE your_table
COMMENT 'This table is created with existing data'
AS SELECT * FROM my_table
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
COMMENT 'This table is created with existing data'
LOCATION 'spark-warehouse/tables/my_existing_table'
Create Table Like
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name1 LIKE [db_name.]table_name2 [LOCATION path]
Create a managed table using the definition/metadata of an existing table or view. The created table always uses its own directory in the default warehouse location.
Note
Delta Lake does not support CREATE TABLE LIKE
. Instead use CREATE TABLE AS
. See AS.