CREATE TABLE USING

Defines a table using a data source.

Syntax

CREATE TABLE [ IF NOT EXISTS ] table_identifier
  [ ( col_name1 col_type1 [ COMMENT col_comment1 ], ... ) ]
  USING data_source
  [ OPTIONS ( key1 [ = ] val1, key2 [ = ] val2, ... ) ]
  [ PARTITIONED BY ( col_name1, col_name2, ... ) ]
  [ CLUSTERED BY ( col_name3, col_name4, ... )
      [ SORTED BY ( col_name [ ASC | DESC ], ... ) ]
      INTO num_buckets BUCKETS ]
  [ LOCATION path ]
  [ COMMENT table_comment ]
  [ TBLPROPERTIES ( key1 [ = ] val1, key2 [ = ] val2, ... ) ]
  [ AS select_statement ]

The clauses between the USING clause and the AS SELECT clause can appear in any order. For example, you can write COMMENT table_comment after TBLPROPERTIES.

Parameters

  • table_identifier

    A table name, optionally qualified with a database name.

    Syntax: [database_name.] table_name

  • USING data_source

    The file format to use for the table. data_source must be one of TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, or LIBSVM, or a fully-qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister. When data_source is DELTA, see the additional options in Create Delta table.

    HIVE is supported to create a Hive SerDe table. You can specify the Hive-specific file_format and row_format using the OPTIONS clause, which is a case-insensitive string map. The option keys are FILEFORMAT, INPUTFORMAT, OUTPUTFORMAT, SERDE, FIELDDELIM, ESCAPEDELIM, MAPKEYDELIM, and LINEDELIM.

    Note

    In Databricks Runtime 8.0 and above the USING clause is optional. If you don’t specify the USING clause, DELTA is the default format. In Databricks Runtime 7.x, when you don’t specify the USING clause, the SQL parser uses the CREATE TABLE with Hive format syntax to parse it. See the Databricks Runtime 8.0 migration guide for details.

  • PARTITIONED BY

    Partitions the table by the specified columns.

  • CLUSTERED BY

    Bucket the partitions created on the table into fixed buckets based on the specified columns. This clause is not supported by Delta Lake.

    Note

    Bucketing is an optimization technique that uses buckets (and bucketing columns) to determine data partitioning and avoid data shuffle.

  • SORTED BY

    The order in which the data is stored in buckets. Default is Ascending. This clause is not supported by Delta Lake.

  • LOCATION

    Path to the directory where table data is stored, which could be a path on distributed storage.

  • OPTIONS

    Table options used to optimize the behavior of the table or configure HIVE tables. This clause is not supported by Delta Lake.

  • COMMENT

    A string literal to describe the table.

  • TBLPROPERTIES

    A list of key-value pairs used to tag the table definition.

  • AS select_statement

    Populates the table using the data from select_statement.

Create Delta table

In addition to the standard CREATE TABLE options, Delta tables support the options described in this section.

Note

  • Available in Databricks Runtime 7.0 and above.
  • In Databricks Runtime 8.0 and above, USING DELTA is optional.
CREATE [OR REPLACE] table_identifier
  [(col_name1 col_type1 [NOT NULL] [COMMENT col_comment1], ...)]
  USING DELTA
  [LOCATION <path-to-delta-files>]

Parameters

  • table_identifier

    • [database_name.] table_name: A table name, optionally qualified with a database name.
    • delta.`<path-to-table>` : Create a table at the specified path without creating an entry in the metastore.
  • OR REPLACE

    If a table already exists, replace the table with the new configuration.

    Note

    Databricks strongly recommends using REPLACE instead of dropping and re-creating tables.

  • NOT NULL

    Indicate that a column value cannot be NULL. The default is to allow a NULL value. If specified any change to the Delta table will check these NOT NULL constraints.

    For examples, see NOT NULL constraint.

  • LOCATION <path-to-delta-files>

    If you specify a LOCATION that already contains data stored in Delta Lake, Delta Lake does the following:

    • If you specify only the table name and location, for example:

      CREATE TABLE events
        USING DELTA
        LOCATION '/mnt/delta/events'
      

      the table in the Hive metastore automatically inherits the schema, partitioning, and table properties of the existing data. This functionality can be used to “import” data into the metastore.

    • If you specify any configuration (schema, partitioning, or table properties), Delta Lake verifies that the specification exactly matches the configuration of the existing data.

    Warning

    If the specified configuration does not exactly match the configuration of the data, Delta Lake throws an exception that describes the discrepancy.

Data source interaction

A data source table acts like a pointer to the underlying data source. For example, you can create a table foo in Databricks that points to a table bar in MySQL using the JDBC data source. When you read and write table foo, you actually read and write table bar.

In general CREATE TABLE is creating a “pointer”, and you must make sure it points to something that exists. An exception is file source such as Parquet, JSON. If you don’t specify the LOCATION, Databricks creates a default table location.

For CREATE TABLE AS SELECT, Databricks overwrites the underlying data source with the data of the input query, to make sure the created table contains exactly the same data as the input query.

Examples

--Use data source
CREATE TABLE student (id INT, name STRING, age INT) USING CSV;

--Use data from another table
CREATE TABLE student_copy USING CSV
    AS SELECT * FROM student;

--Omit the USING clause, which uses the default data source (parquet by default)
CREATE TABLE student (id INT, name STRING, age INT);

--Specify table comment and properties
CREATE TABLE student (id INT, name STRING, age INT) USING CSV
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

--Specify table comment and properties with different clauses order
CREATE TABLE student (id INT, name STRING, age INT) USING CSV
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

--Create partitioned and bucketed table
CREATE TABLE student (id INT, name STRING, age INT)
    USING CSV
    PARTITIONED BY (age)
    CLUSTERED BY (Id) INTO 4 buckets;