COPY INTO
Applies to: Databricks SQL
Databricks Runtime
Loads data from a file location into a Delta table. This is a retriable and idempotent operation—files in the source location that have already been loaded are skipped. For examples, see Common data loading patterns with COPY INTO.
Syntax
COPY INTO target_table
FROM { source_clause |
( SELECT expression_list FROM source_clause ) }
FILEFORMAT = data_source
[ VALIDATE [ ALL | num_rows ROWS ] ]
[ FILES = ( file_name [, ...] ) | PATTERN = glob_pattern ]
[ FORMAT_OPTIONS ( { data_source_reader_option = value } [, ...] ) ]
[ COPY_OPTIONS ( { copy_option = value } [, ...] ) ]
source_clause
source [ WITH ( [ CREDENTIAL { credential_name |
(temporary_credential_options) } ]
[ ENCRYPTION (encryption_options) ] ) ]
Parameters
-
Identifies an existing Delta table. The target_table must not include a temporal specification.
If the table name is provided in the form of a location, such as:
delta.`/path/to/table`
, Unity Catalog can govern access to the locations that are being written to. You can write to an external location by:Defining the location as an external location and having
WRITE FILES
permissions on that external location.Having
WRITE FILES
permissions on a named storage credential that provide authorization to write to a location using:COPY INTO delta.`/some/location` WITH (CREDENTIAL <named_credential>)
See Manage external locations and storage credentials for more details.
source
The file location to load the data from. Files in this location must have the format specified in
FILEFORMAT
. The location is provided in the form of a URI.Access to the source location can be provided through:
-
Optional name of the credential used to access or write to the storage location. You use this credential only if the file location is not included in an external location.
Inline temporary credentials.
Defining the source location as an external location and having
READ FILES
permissions on the external location through Unity Catalog.Using a named storage credential with
READ FILES
permissions that provide authorization to read from a location through Unity Catalog.
You don’t need to provide inline or named credentials if the path is already defined as an external location that you have permissions to use. See Manage external locations and storage credentials for more details.
Note
If the source file path is a root path, please add a slash (
/
) at the end of the file path, for example,s3://my-bucket/
.Accepted credential options are:
AWS_ACCESS_KEY
,AWS_SECRET_KEY
, andAWS_SESSION_TOKEN
for AWS S3AZURE_SAS_TOKEN
for ADLS Gen2 and Azure Blob Storage
Accepted encryption options are:
TYPE = 'AWS_SSE_C'
, andMASTER_KEY
for AWS S3
-
See Use temporary credentials to load data with COPY INTO.
SELECT expression_list
Selects the specified columns or expressions from the source data before copying into the Delta table. The expressions can be anything you use with
SELECT
statements, including window operations. You can use aggregation expressions only for global aggregates–you cannotGROUP BY
on columns with this syntax.FILEFORMAT = data_source
The format of the source files to load. One of
CSV
,JSON
,AVRO
,ORC
,PARQUET
,TEXT
,BINARYFILE
.VALIDATE
Applies to:
Databricks SQL
Databricks Runtime 10.3 and above
The data that is to be loaded into a table is validated but not written to the table. These validations include:
Whether the data can be parsed.
Whether the schema matches that of the table or if the schema needs to be evolved.
Whether all nullability and check constraints are met.
The default is to validate all of the data that is to be loaded. You can provide a number of rows to be validated with the
ROWS
keyword, such asVALIDATE 15 ROWS
. TheCOPY INTO
statement returns a preview of the data of 50 rows or less, when a number of less than 50 is used with theROWS
keyword).FILES
A list of file names to load, with length up to 1000. Cannot be specified with
PATTERN
.PATTERN
A glob pattern that identifies the files to load from the source directory. Cannot be specified with
FILES
.Pattern
Description
?
Matches any single character
*
Matches zero or more characters
[abc]
Matches a single character from character set {a,b,c}.
[a-z]
Matches a single character from the character range {a…z}.
[^a]
Matches a single character that is not from character set or range {a}. Note that the
^
character must occur immediately to the right of the opening bracket.{ab,cd}
Matches a string from the string set {ab, cd}.
{ab,c{de, fh}}
Matches a string from the string set {ab, cde, cfh}.
FORMAT_OPTIONS
Options to be passed to the Apache Spark data source reader for the specified format. See Format options for each file format.
COPY_OPTIONS
Options to control the operation of the
COPY INTO
command.force
: boolean, defaultfalse
. If set totrue
, idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.mergeSchema
: boolean, defaultfalse
. If set totrue
, the schema can be evolved according to the incoming data.
Access file metadata
To learn how to access metadata for file-based data sources, see File metadata column.
Format options
Generic options
The following options apply to all file formats.
Option |
---|
ignoreCorruptFiles Type: Whether to ignore corrupt files. If true, the Spark jobs will continue
to run when encountering corrupted files and the contents that have been
read will still be returned. Observable as Default value: |
ignoreMissingFiles Type: Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned. Available in Databricks Runtime 11.0 and above. Default value: |
modifiedAfter Type: An optional timestamp to ingest files that have a modification timestamp after the provided timestamp. Default value: None |
modifiedBefore Type: An optional timestamp to ingest files that have a modification timestamp before the provided timestamp. Default value: None |
pathGlobFilter Type: A potential glob pattern to provide for choosing files. Equivalent to
Default value: None |
recursiveFileLookup Type: Whether to load data recursively within the base directory and skip partition inference. Default value: |
JSON
options
Option |
---|
allowBackslashEscapingAnyCharacter Type: Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped. Default value: |
allowComments Type: Whether to allow the use of Java, C, and C++ style comments
( Default value: |
allowNonNumericNumbers Type: Whether to allow the set of not-a-number ( Default value: |
allowNumericLeadingZeros Type: Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001). Default value: |
allowSingleQuotes Type: Whether to allow use of single quotes (apostrophe,
character Default value: |
allowUnquotedControlChars Type: Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. Default value: |
allowUnquotedFieldNames Type: Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification). Default value: |
badRecordsPath Type: The path to store files for recording the information about bad JSON records. Default value: None |
columnNameOfCorruptRecord Type: The column for storing records that are malformed and cannot be parsed.
If the Default value: |
dateFormat Type: The format for parsing date strings. Default value: |
dropFieldIfAllNull Type: Whether to ignore columns of all null values or empty arrays and structs during schema inference. Default value: |
encoding or charset Type: The name of the encoding of the JSON files. See Default value: |
inferTimestamp Type: Whether to try and infer timestamp strings as a Default value: |
lineSep Type: A string between two consecutive JSON records. Default value: None, which covers |
locale Type: A Default value: |
mode Type: Parser mode around handling malformed records. One of Default value: |
multiLine Type: Whether the JSON records span multiple lines. Default value: |
prefersDecimal Type: Whether to infer floats and doubles as Default value: |
primitivesAsString Type: Whether to infer primitive types like numbers and booleans as Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to What is the rescued data column?. Default value: None |
timestampFormat Type: The format for parsing timestamp strings. Default value: |
timeZone Type: The Default value: None |
CSV
options
Option |
---|
badRecordsPath Type: The path to store files for recording the information about bad CSV records. Default value: None |
charToEscapeQuoteEscaping Type: The character used to escape the character used for escaping quotes.
For example, for the following record:
Default value: |
columnNameOfCorruptRecord Type: A column for storing records that are malformed and cannot be parsed.
If the Default value: |
comment Type: Defines the character that represents a line comment when found in the
beginning of a line of text. Use Default value: |
dateFormat Type: The format for parsing date strings. Default value: |
emptyValue Type: String representation of an empty value. Default value: |
encoding or charset Type: The name of the encoding of the CSV files. See Default value: |
enforceSchema Type: Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution. Default value: |
escape Type: The escape character to use when parsing the data. Default value: |
header Type: Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema. Default value: |
ignoreLeadingWhiteSpace Type: Whether to ignore leading whitespaces for each parsed value. Default value: |
ignoreTrailingWhiteSpace Type: Whether to ignore trailing whitespaces for each parsed value. Default value: |
inferSchema Type: Whether to infer the data types of the parsed CSV records or to assume all
columns are of Default value: |
lineSep Type: A string between two consecutive CSV records. Default value: None, which covers |
locale Type: A Default value: |
maxCharsPerColumn Type: Maximum number of characters expected from a value to parse. Can be used to
avoid memory errors. Defaults to Default value: |
maxColumns Type: The hard limit of how many columns a record can have. Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema. Default value: |
mode Type: Parser mode around handling malformed records. One of Default value: |
multiLine Type: Whether the CSV records span multiple lines. Default value: |
nanValue Type: The string representation of a non-a-number value when parsing Default value: |
negativeInf Type: The string representation of negative infinity when parsing Default value: |
nullValue Type: String representation of a null value. Default value: |
parserCaseSensitive (deprecated) Type: While reading files, whether to align columns declared in the header with the
schema case sensitively. This is Default value: |
positiveInf Type: The string representation of positive infinity when parsing Default value: |
quote Type: The character used for escaping values where the field delimiter is part of the value. Default value: |
readerCaseSensitive Type: Specifies the case sensitivity behavior when Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. Default value: None |
sep or delimiter Type: The separator string between columns. Default value: |
skipRows Type: The number of rows from the beginning of the CSV file that should be ignored
(including commented and empty rows). If Default value: |
timestampFormat Type: The format for parsing timestamp strings. Default value: |
timeZone Type: The Default value: None |
unescapedQuoteHandling Type: The strategy for handling unescaped quotes. Allowed options:
Default value: |
PARQUET
options
Option |
---|
datetimeRebaseMode Type: Controls the rebasing of the DATE and TIMESTAMP values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
int96RebaseMode Type: Controls the rebasing of the INT96 timestamp values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Default value: |
readerCaseSensitive Type: Specifies the case sensitivity behavior when Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. Default value: None |
AVRO
options
Option |
---|
avroSchema Type: Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result will contain the new column too. Default value: None |
datetimeRebaseMode Type: Controls the rebasing of the DATE and TIMESTAMP values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema
of each file.
Default value: |
readerCaseSensitive Type: Specifies the case sensitivity behavior when Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. Default value: None |
BINARYFILE
options
Binary files do not have any additional configuration options.
TEXT
options
Option |
---|
encoding Type: The name of the encoding of the TEXT files. See Default value: |
lineSep Type: A string between two consecutive TEXT records. Default value: None, which covers |
wholeText Type: Whether to read a file as a single record. Default value: |
ORC
options
Option |
---|
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Default value: |