Create Table

Create Table Using

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING data_source
  [OPTIONS (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [AS select_statement]

Create a table using a data source. If a table with the same name already exists in the database, an exception is thrown.

IF NOT EXISTS

If a table with the same name already exists in the database, nothing will happen.

USING data_source

The file format to use for the table. data_source must be one of TEXT, AVRO, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, or LIBSVM, or a fully-qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister.

HIVE is supported to create a Hive SerDe table. You can specify the Hive-specific file_format and row_format using the OPTIONS clause, which is a case-insensitive string map. The option keys are FILEFORMAT, INPUTFORMAT, OUTPUTFORMAT, SERDE, FIELDDELIM, ESCAPEDELIM, MAPKEYDELIM, and LINEDELIM.

OPTIONS

Table options used to optimize the behavior of the table or configure HIVE tables.

Note

This clause is not supported by Delta Lake.

PARTITIONED BY (col_name1, col_name2, ...)

Partition the created table by the specified columns. A directory is created for each partition.

CLUSTERED BY col_name3, col_name4, ...)

Each partition in the created table will be split into a fixed number of buckets by the specified columns. This is typically used with partitioning to read and shuffle less data.

LOCATION path

The directory to store the table data. This clause automatically implies EXTERNAL.

AS select_statement

Populate the table with input data from the SELECT statement. This cannot contain a column list.

Create Table Using Delta (Delta Lake on Databricks)

CREATE [OR REPLACE] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [NOT NULL] [COMMENT col_comment1], ...)]
  USING DELTA
  [LOCATION <path-to-delta-files>]

NOT NULL

Indicate that a column value cannot be NULL. If specified, and an Insert or Update (Delta Lake on Databricks) statements sets a column value to NULL, a SparkException is thrown. The default is to allow a NULL value.

LOCATION <path-to-delta-files>

If you specify a LOCATION that already contains data stored in Delta Lake, Delta Lake does the following:

  • If you specify only the table name and location, for example:

    CREATE TABLE events
      USING DELTA
      LOCATION '/mnt/delta/events'
    

    the table in the Hive metastore automatically inherits the schema, partitioning, and table properties of the existing data. This functionality can be used to “import” data into the metastore.

  • If you specify any configuration (schema, partitioning, or table properties), Delta Lake verifies that the specification exactly matches the configuration of the existing data.

Warning

If the specified configuration does not exactly match the configuration of the data, Delta Lake throws an exception that describes the discrepancy.

Examples

CREATE TABLE boxes (width INT, length INT, height INT) USING CSV

CREATE TABLE boxes
  (width INT, length INT, height INT)
  USING PARQUET
  OPTIONS ('compression'='snappy')

CREATE TABLE rectangles
  USING PARQUET
  PARTITIONED BY (width)
  CLUSTERED BY (length) INTO 8 buckets
  AS SELECT * FROM boxes

-- CREATE a HIVE SerDe table using the CREATE TABLE USING syntax.
CREATE TABLE my_table (name STRING, age INT, hair_color STRING)
  USING HIVE
  OPTIONS(
      INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat',
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat',
      SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe')
  PARTITIONED BY (hair_color)
  TBLPROPERTIES ('status'='staging', 'owner'='andrew')

Create Table with Hive format

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1[:] col_type1 [COMMENT col_comment1], ...)]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
  [ROW FORMAT row_format]
  [STORED AS file_format]
  [LOCATION path]
  [TBLPROPERTIES (key1=val1, key2=val2, ...)]
  [AS select_statement]

row_format:
  : SERDE serde_cls [WITH SERDEPROPERTIES (key1=val1, key2=val2, ...)]
  | DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]]
      [COLLECTION ITEMS TERMINATED BY char]
      [MAP KEYS TERMINATED BY char]
      [LINES TERMINATED BY char]
      [NULL DEFINED AS char]

file_format:
  : TEXTFILE | SEQUENCEFILE | RCFILE | ORC | PARQUET | AVRO
  | INPUTFORMAT input_fmt OUTPUTFORMAT output_fmt

Create a table using the Hive format. If a table with the same name already exists in the database, an exception will be thrown. When the table is dropped later, its data will be deleted from the file system.

Note

This command is supported only when Hive support is enabled.

EXTERNAL

The table uses the custom directory specified with LOCATION. Queries on the table access existing data previously stored in the directory. When an EXTERNAL table is dropped, its data is not deleted from the file system. This flag is implied if LOCATION is specified.

IF NOT EXISTS

If a table with the same name already exists in the database, nothing will happen.

PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)

Partition the table by the specified columns. This set of columns must be distinct from the set of non-partitioned columns. You cannot specify partitioned columns with AS select_statement.

ROW FORMAT

Use the SERDE clause to specify a custom SerDe for this table. Otherwise, use the DELIMITED clause to use the native SerDe and specify the delimiter, escape character, null character, and so on.

STORED AS file_format

Specify the file format for this table. Available formats include TEXTFILE, SEQUENCEFILE, RCFILE, ORC, PARQUET, and AVRO. Alternatively, you can specify your own input and output formats through INPUTFORMAT and OUTPUTFORMAT. Only formats TEXTFILE, SEQUENCEFILE, and RCFILE can be used with ROW FORMAT SERDE and only TEXTFILE can be used with ROW FORMAT DELIMITED.

LOCATION path

The directory to store the table data. This clause automatically implies EXTERNAL.

AS select_statement

Populate the table with input data from the select statement. You cannot specify this with PARTITIONED BY.

Data types

Spark SQL supports the following data types:

  • Numeric types
    • ByteType: Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.
    • ShortType: Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.
    • IntegerType: Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.
    • LongType: Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.
    • FloatType: Represents 4-byte single-precision floating point numbers.
    • DoubleType: Represents 8-byte double-precision floating point numbers.
    • DecimalType: Represents arbitrary-precision signed decimal numbers. Backed internally by java.math.BigDecimal. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
  • String type: StringType: Represents character string values.
  • Binary type: BinaryType: Represents byte sequence values.
  • Boolean type: BooleanType: Represents boolean values.
  • Datetime types
    • TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local time zone. The timestamp value represents an absolute point in time.
    • DateType: Represents values comprising values of fields year, month and day, without a time-zone.
  • Complex types
    • ArrayType(elementType, containsNull): Represents values comprising a sequence of elements with the type of elementType. containsNull is used to indicate if elements in a ArrayType value can have null values.
    • MapType(keyType, valueType, valueContainsNull): Represents values comprising a set of key-value pairs. The data type of keys is described by keyType and the data type of values is described by valueType. For a MapType value, keys are not allowed to have null values. valueContainsNull is used to indicate if values of a MapType value can have null values.
    • StructType(fields): Represents values with the structure described by a sequence of StructField (fields).
    • StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of these fields can have null values.

The following table shows the type names and aliases for each data type.

Data type SQL name
BooleanType BOOLEAN
ByteType BYTE, TINYINT
ShortType SHORT, SMALLINT
IntegerType INT, INTEGER
LongType LONG, BIGINT
FloatType FLOAT, REAL
DoubleType DOUBLE
DateType DATE
TimestampType TIMESTAMP
StringType STRING
BinaryType BINARY
DecimalType DECIMAL, DEC, NUMERIC
CalendarIntervalType INTERVAL
ArrayType ARRAY<element_type>
StructType STRUCT<field1_name: field1_type, field2_name: field2_type, ...>
MapType MAP<key_type, value_type>

Examples

CREATE TABLE my_table (name STRING, age INT)

CREATE TABLE my_table (name STRING, age INT)
  COMMENT 'This table is partitioned'
  PARTITIONED BY (hair_color STRING COMMENT 'This is a column comment')
  TBLPROPERTIES ('status'='staging', 'owner'='andrew')

CREATE TABLE my_table (name STRING, age INT)
  COMMENT 'This table specifies a custom SerDe'
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS
      INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

CREATE TABLE my_table (name STRING, age INT)
  COMMENT 'This table uses the CSV format'
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  STORED AS TEXTFILE

CREATE TABLE your_table
  COMMENT 'This table is created with existing data'
  AS SELECT * FROM my_table

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
  COMMENT 'This table is created with existing data'
  LOCATION 'spark-warehouse/tables/my_existing_table'

Create Table Like

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name1 LIKE [db_name.]table_name2 [LOCATION path]

Create a managed table using the definition/metadata of an existing table or view. The created table always uses its own directory in the default warehouse location.

Note

Delta Lake does not support CREATE TABLE LIKE. Instead use CREATE TABLE AS. See AS.