Insert

Inserting From Select Queries

INSERT INTO [TABLE] [db_name.]table_name [PARTITION part_spec] select_statement

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

part_spec:
    : (part_col_name1=val1 [, part_col_name2=val2, ...])

Insert data into a table or a partition from the result table of a select statement.

OVERWRITE
Whether to overwrite existing data in the table or the partition. If this flag is not provided, the new data is appended.

Examples

-- Creates a partitioned native parquet table
CREATE TABLE data_source_tab1 (col1 INT, p1 INT, p2 INT)
  USING PARQUET PARTITIONED BY (p1, p2)

-- Appends two rows into the partition (p1 = 3, p2 = 4)
INSERT INTO data_source_tab1 PARTITION (p1 = 3, p2 = 4)
  SELECT id FROM RANGE(1, 3)

-- Overwrites the partition (p1 = 3, p2 = 4) using two new rows
INSERT OVERWRITE TABLE default.data_source_tab1 PARTITION (p1 = 3, p2 = 4)
  SELECT id FROM RANGE(3, 5)

Inserting Values Into Tables

INSERT INTO [TABLE] [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]

values_row:
    : (val1 [, val2, ...])

Insert data into a table or a partition from a row value list.

OVERWRITE
Whether to override existing data in the table or the partition. If this flag is not provided, the new data is appended.

Examples

-- Creates a partitioned hive serde table (using the HiveQL syntax)
CREATE TABLE hive_serde_tab1 (col1 INT, p1 INT, p2 INT)
  USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (p1, p2)

-- Appends two rows into the partition (p1 = 3, p2 = 4)
INSERT INTO hive_serde_tab1 PARTITION (p1 = 3, p2 = 4)
  VALUES (1), (2)

-- Overwrites the partition (p1 = 3, p2 = 4) using two new rows
INSERT OVERWRITE hive_serde_tab1 PARTITION (p1 = 3, p2 = 4)
  VALUES (3), (4)

Dynamic Partition Inserts

When the partition specification part_spec is not completely provided, such inserts are called as the dynamic partition inserts, also called as multi-partition inserts. In part_spec, the partition column values are optional. When the values are not given, these columns are referred to as dynamic partition columns; otherwise, they are static partition columns. For example, the partition spec (p1 = 3, p2, p3) has a static partition column (p1) and two dynamic partition columns (p2 and p3).

In part_spec, the static partition keys must come before the dynamic partition keys. That means, all partition columns having constant values need to appear before other partition columns that do not have an assigned constant value.

The partition values of dynamic partition columns are determined during the execution. The dynamic partition columns must be specified last in both part_spec and the input result set (of the row value lists or the select query). They are resolved by position, instead of by names. Thus, the orders must be exactly matched.

Currently the DataFrameWriter APIs do not have an interface for users to specify the partition values, therefore, its insertInto() API is always using the dynamic partition mode.

Attention

In the dynamic partition mode, the input result set could result in a large number of dynamic partitions, and thus generate a large number of partition directories.

OVERWRITE

The semantics are different based on the types of target tables.

  • Hive serde tables: INSERT OVERWRITE doesn’t delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. This matches Apache Hive semantics.
  • Native data source tables: INSERT OVERWRITE first deletes all the partitions that match the partition specification (e.g., PARTITION(a=1, b)) and then inserts all the remaining values. Note: since Databricks Runtime 3.2, the behavior of native data source tables can be changed to be consistent with Hive serde tables by changing the session-specific configuration spark.sql.sources.partitionOverwriteMode to DYNAMIC. The default mode is STATIC.
Hive serde tables
For the Hive serde tables, Spark SQL also respects the Hive-related configuration, including hive.exec.dynamic.partition and hive.exec.dynamic.partition.mode

Examples

-- Creates a partitioned native parquet table
CREATE TABLE data_source_tab2 (col1 INT, p1 STRING, p2 STRING)
  USING PARQUET PARTITIONED BY (p1, p2)

-- Two partitions ('part1', 'part1') and ('part1', 'part2') are created by this dynamic insert.
-- The dynamic partition column p2 is resolved by the last column `'part' || id`
INSERT INTO data_source_tab2 PARTITION (p1 = 'part1', p2)
  SELECT id, 'part' || id FROM RANGE(1, 3)

-- A new partition ('partNew1', 'partNew2') is added by this INSERT OVERWRITE.
INSERT OVERWRITE TABLE data_source_tab2 PARTITION (p1 = 'partNew1', p2)
  VALUES (3, 'partNew2')

-- After this INSERT OVERWRITE, the two partitions ('part1', 'part1') and ('part1', 'part2') are dropped,
-- because both partitions are included by (p1 = 'part1', p2).
-- Then, two partitions ('partNew1', 'partNew2'), ('part1', 'part1') exist after this operation.
INSERT OVERWRITE TABLE data_source_tab2 PARTITION (p1 = 'part1', p2)
  VALUES (5, 'part1')


-- Creates and fills a partitioned hive serde table with three partitions:
-- ('part1', 'part1'), ('part1', 'part2') and ('partNew1', 'partNew2')
CREATE TABLE hive_serde_tab2 (col1 INT, p1 STRING, p2 STRING)
  USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (p1, p2)
INSERT INTO hive_serde_tab2 PARTITION (p1 = 'part1', p2)
  SELECT id, 'part' || id FROM RANGE(1, 3)
INSERT OVERWRITE TABLE hive_serde_tab2 PARTITION (p1 = 'partNew1', p2)
  VALUES (3, 'partNew2')

-- After this INSERT OVERWRITE, only the partitions ('part1', 'part1') is overwritten by the new value.
-- All the three partitions still exist.
INSERT OVERWRITE TABLE hive_serde_tab2 PARTITION (p1 = 'part1', p2)
  VALUES (5, 'part1')

Inserting Values Into Directory

New in version runtime-3.3.

INSERT OVERWRITE [LOCAL] DIRECTORY [directory_path]
  USING datasource [OPTIONS (key1=val1, key2=val2, ...)]
  [AS] SELECT ... FROM ...

Insert the query results of select_statement into a directory directory_path using Spark native format. If the specified path exists, it will be replaced with the output of the select_statement.

directory_path
The path of the destination directory of the insert. The directory can also be specified in OPTIONS using the key path. If the specified path exists, it will be replaced with the output of the select_statement. If LOCAL is used, the directory is on the local file system.
USING <data source>
Specify the file format to use for this insert. The data source may be one of TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE and LIBSVM, or a fully qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister.
AS <select_statement>
Populate the destination directory with input data from the select statement.

Examples

INSERT OVERWRITE DIRECTORY
USING parquet
OPTIONS ('path' '/tmp/destination/path')
SELECT key, col1, col2 FROM source_table

INSERT OVERWRITE DIRECTORY '/tmp/destination/path'
USING json
SELECT 1 as a, 'c' as b

Inserting Values Into Directory with Hive Format

New in version runtime-3.3.

INSERT OVERWRITE [LOCAL] DIRECTORY directory_path
  [ROW FORMAT row_format] [STORED AS file_format]
  [AS] select_statement

Insert the query results of select_statement into a directory directory_path using Hive serde. If the specified path exists, it will be replaced with the output of the select_statement. Note: This command is supported only when Hive support is enabled.

directory_path
The path of the destination directory of the insert. If the specified path exists, it will be replaced with the output of the select_statement. If LOCAL is used, the directory is on the local file system.
ROW FORMAT
Use the SERDE clause to specify a custom SerDe for this insert. Otherwise, use the DELIMITED clause to use the native SerDe and specify the delimiter, escape character, null character etc.
STORED AS
Specify the file format for this insert. Available formats include TEXTFILE, SEQUENCEFILE, RCFILE, ORC, PARQUET and AVRO. Alternatively, the user may specify his own input and output formats through INPUTFORMAT and OUTPUTFORMAT. Note that only formats TEXTFILE, SEQUENCEFILE, and RCFILE may be used with ROW FORMAT SERDE, and only TEXTFILE may be used with ROW FORMAT DELIMITED.
AS <select_statement>
Populate the destination directory with input data from the select statement.

Examples

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/destination/path'
STORED AS orc
SELECT * FROM source_table where key < 10