`COPY INTO`

Applies to: Databricks SQL Databricks Runtime

Loads data from a file location into a Delta table. This is a retryable and idempotent operation — Files in the source location that have already been loaded are skipped. This is true even if the files have been modified since they were loaded. For examples, see Common data loading patterns using COPY INTO.

Syntax

SQL
COPY INTO target_table [ BY POSITION | ( col_name [ , <col_name> ... ] ) ]
  FROM { source_clause |
         ( SELECT expression_list FROM source_clause ) }
  FILEFORMAT = data_source
  [ VALIDATE [ ALL | num_rows ROWS ] ]
  [ FILES = ( file_name [, ...] ) | PATTERN = glob_pattern ]
  [ FORMAT_OPTIONS ( { data_source_reader_option = value } [, ...] ) ]
  [ COPY_OPTIONS ( { copy_option = value } [, ...] ) ]

source_clause
  source [ WITH ( [ CREDENTIAL { credential_name |
                                 (temporary_credential_options) } ]
                  [ ENCRYPTION (encryption_options) ] ) ]

Parameters

target_table

Identifies an existing Delta table. The target_table must not include a temporal specification or options specification.

If the table name is provided in the form of a location, such as: delta.`/path/to/table` , Unity Catalog can govern access to the locations that are being written to. You can write to an external location by:
- Defining the location as an external location and having WRITE FILES permissions on that external location.
- Having WRITE FILES permissions on a named storage credential that provide authorization to write to a location using: COPY INTO delta.`/some/location` WITH (CREDENTIAL <named-credential>)
See Connect to cloud object storage using Unity Catalog for more details.
BY POSITION | ( col_name [ , <col_name> … ] )

Matches source columns to target table columns by ordinal position. Type casting of the matched columns is done automatically.

This parameter is only supported for headerless CSV file format. You must specify FILEFORMAT = CSV. FORMAT_OPTIONS must also be set to ("headers" = "false") (FORMAT_OPTIONS ("headers" = "false") is the default).

Syntax option 1: BY POSITION
- Matches source columns to target table columns by ordinal position automatically.
  - The default name matching is not used for matching.
  - IDENTITY columns and GENERATED columns of the target table are ignored when matching the source columns.
  - If the number of source columns doesn't equal the filtered target table columns, COPY INTO raises an error.
Syntax option 2: ( col_name [ , <col_name> ... ] )
- Matches source columns to the specified target table columns by relative ordinal position using a target table column name list in parentheses, separated by comma.
  - The original table column order and column names are not used for matching.
  - IDENTITY columns and GENERATED columns cannot be specified in the column name list, otherwise COPY INTO raises an error.
  - The specified columns cannot be duplicated.
  - When the number of source columns doesn't equal the specified table columns, COPY INTO raises an error.
  - For the columns that are not specified in the column name list, COPY INTO assigns default values, if any, and assigns NULL otherwise. If any column is not nullable, COPY INTO raises an error.
source

The file location to load the data from. Files in this location must have the format specified in FILEFORMAT. The location is provided in the form of a URI.

Access to the source location can be provided through:
- credential_name
  
  Optional name of the credential used to access or write to the storage location. You use this credential only if the file location is not included in an external location. See credential_name.
- Inline temporary credentials.
- Defining the source location as an external location and having READ FILES permissions on the external location through Unity Catalog.
- Using a named storage credential with READ FILES permissions that provide authorization to read from a location through Unity Catalog.
You don't need to provide inline or named credentials if the path is already defined as an external location that you have permissions to use. See Create an external location to connect cloud storage to Databricks for more details.

note
If the source file path is a root path, please add a slash (/) at the end of the file path, for example, s3://my-bucket/.

Accepted credential options are:
- AWS_ACCESS_KEY, AWS_SECRET_KEY, and AWS_SESSION_TOKEN for AWS S3
- AZURE_SAS_TOKEN for ADLS and Azure Blob Storage
Accepted encryption options are:
- TYPE = 'AWS_SSE_C', and MASTER_KEY for AWS S3

See Load data using COPY INTO with temporary credentials.

SELECT expression_list

Selects the specified columns or expressions from the source data before copying into the Delta table. The expressions can be anything you use with SELECT statements, including window operations. You can use aggregation expressions only for global aggregates–you cannot GROUP BY on columns with this syntax.
FILEFORMAT = data_source

The format of the source files to load. One of CSV, JSON, AVRO, ORC, PARQUET, TEXT, BINARYFILE.
VALIDATE

Applies to: Databricks SQL Databricks Runtime 10.4 LTS and above

The data that is to be loaded into a table is validated but not written to the table. These validations include:
- Whether the data can be parsed.
- Whether the schema matches that of the table or if the schema needs to be evolved.
- Whether all nullability and check constraints are met.
The default is to validate all of the data that is to be loaded. You can provide a number of rows to be validated with the ROWS keyword, such as VALIDATE 15 ROWS. The COPY INTO statement returns a preview of the data of 50 rows or less when a number of less than 50 is used with the ROWS keyword).
FILES

A list of file names to load, with a limit of 1000 files. Cannot be specified with PATTERN.

PATTERN

A glob pattern that identifies the files to load from the source directory. Cannot be specified with FILES.

Pattern	Description
`?`	Matches any single character
`*`	Matches zero or more characters
`[abc]`	Matches a single character from character set {a,b,c}.
`[a-z]`	Matches a single character from the character range {a…z}.
`[^a]`	Matches a single character that is not from character set or range {a}. Note that the `^` character must occur immediately to the right of the opening bracket.
`{ab,cd}`	Matches a string from the string set {ab, cd}.
`{ab,c{de, fh}}`	Matches a string from the string set {ab, cde, cfh}.

FORMAT_OPTIONS

Options to be passed to the Apache Spark data source reader for the specified format. See Format options for each file format.
COPY_OPTIONS

Options to control the operation of the COPY INTO command.
- force: boolean, default false. If set to true, idempotency is disabled and files are loaded regardless of whether they've been loaded before.
- mergeSchema: boolean, default false. If set to true, the schema can be evolved according to the incoming data.

Invoke `COPY INTO` concurrently

COPY INTO supports concurrent invocations against the same table. As long as COPY INTO is invoked concurrently on distinct sets of input files, each invocation should eventually succeed, otherwise you get a transaction conflict. COPY INTO should not be invoked concurrently to improve performance; a single COPY INTO command with multiple files typically performs better than running concurrent COPY INTO commands with a single file each. COPY INTO can be called concurrently when:

Multiple data producers don't have an easy way to coordinate and cannot make a single invocation.
When a very large directory can be ingested sub-directory by sub-directory. When ingesting directories with a very large number of files, Databricks recommends using Auto Loader when possible.

Access file metadata

To learn how to access metadata for file-based data sources, see File metadata column.

Generic options

The following options apply to all file formats.

Option
`ignoreCorruptFiles` Type: `Boolean` Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned. Observable as `numSkippedCorruptFiles` in the `operationMetrics` column of the Delta Lake history. Available in Databricks Runtime 11.3 LTS and above. Default value: `false`
`ignoreMissingFiles` Type: `Boolean` Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned. Available in Databricks Runtime 11.3 LTS and above. Default value: `false` for Auto Loader, `true` for `COPY INTO` (legacy)
`modifiedAfter` Type: `Timestamp String`, for example, `2021-01-01 00:00:00.000000 UTC+0` An optional timestamp as a filter to only ingest files that have a modification timestamp after the provided timestamp. Default value: None
`modifiedBefore` Type: `Timestamp String`, for example, `2021-01-01 00:00:00.000000 UTC+0` An optional timestamp as a filter to only ingest files that have a modification timestamp before the provided timestamp. Default value: None
`pathGlobFilter` or `fileNamePattern` Type: `String` A potential glob pattern to provide for choosing files. Equivalent to `PATTERN` in `COPY INTO` (legacy). `fileNamePattern` can be used in `read_files`. Default value: None
`recursiveFileLookup` Type: `Boolean` This option searches through nested directories even if their names do not follow a partition naming scheme like date=2019-07-01. Default value: `false`

Option

ignoreCorruptFiles

Type: Boolean

Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned. Observable as numSkippedCorruptFiles in the operationMetrics column of the Delta Lake history. Available in Databricks Runtime 11.3 LTS and above.

Default value: false

ignoreMissingFiles

Type: Boolean

Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned. Available in Databricks Runtime 11.3 LTS and above.

Default value: false for Auto Loader, true for COPY INTO (legacy)

modifiedAfter

Type: Timestamp String, for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp as a filter to only ingest files that have a modification timestamp after the provided timestamp.

Default value: None

modifiedBefore

Type: Timestamp String, for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp as a filter to only ingest files that have a modification timestamp before the provided timestamp.

Default value: None

pathGlobFilter or fileNamePattern

Type: String

A potential glob pattern to provide for choosing files. Equivalent to PATTERN in COPY INTO (legacy). fileNamePattern can be used in read_files.

Default value: None

recursiveFileLookup

Type: Boolean

This option searches through nested directories even if their names do not follow a partition naming scheme like date=2019-07-01.

Default value: false

`JSON` options

Option
`allowBackslashEscapingAnyCharacter` Type: `Boolean` Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped. Default value: `false`
`allowComments` Type: `Boolean` Whether to allow the use of Java, C, and C++ style comments (`'/'`, `'*'`, and `'//'` varieties) within parsed content or not. Default value: `false`
`allowNonNumericNumbers` Type: `Boolean` Whether to allow the set of not-a-number (`NaN`) tokens as legal floating number values. Default value: `true`
`allowNumericLeadingZeros` Type: `Boolean` Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, `000001`). Default value: `false`
`allowSingleQuotes` Type: `Boolean` Whether to allow use of single quotes (apostrophe, character `'\'`) for quoting strings (names and String values). Default value: `true`
`allowUnquotedControlChars` Type: `Boolean` Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. Default value: `false`
`allowUnquotedFieldNames` Type: `Boolean` Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification). Default value: `false`
`badRecordsPath` Type: `String` The path to store files for recording the information about bad JSON records. Using the `badRecordsPath` option in a file-based data source has the following limitations: It is non-transactional and can lead to inconsistent results. Transient errors are treated as failures. Default value: None
`columnNameOfCorruptRecord` Type: `String` The column for storing records that are malformed and cannot be parsed. If the `mode` for parsing is set as `DROPMALFORMED`, this column will be empty. Default value: `_corrupt_record`
`dateFormat` Type: `String` The format for parsing date strings. Default value: `yyyy-MM-dd`
`dropFieldIfAllNull` Type: `Boolean` Whether to ignore columns of all null values or empty arrays and structs during schema inference. Default value: `false`
`encoding` or `charset` Type: `String` The name of the encoding of the JSON files. See `java.nio.charset.Charset` for list of options. You cannot use `UTF-16` and `UTF-32` when `multiline` is `true`. Default value: `UTF-8`
`inferTimestamp` Type: `Boolean` Whether to try and infer timestamp strings as a `TimestampType`. When set to `true`, schema inference might take noticeably longer. You must enable `cloudFiles.inferColumnTypes` to use with Auto Loader. Default value: `false`
`lineSep` Type: `String` A string between two consecutive JSON records. Default value: None, which covers `\r`, `\r\n`, and `\n`
`locale` Type: `String` A `java.util.Locale` identifier. Influences default date, timestamp, and decimal parsing within the JSON. Default value: `US`
`mode` Type: `String` Parser mode around handling malformed records. One of `PERMISSIVE`, `DROPMALFORMED`, or `FAILFAST`. Default value: `PERMISSIVE`
`multiLine` Type: `Boolean` Whether the JSON records span multiple lines. Default value: `false`
`prefersDecimal` Type: `Boolean` Attempts to infer strings as `DecimalType` instead of float or double type when possible. You must also use schema inference, either by enabling `inferSchema` or using `cloudFiles.inferColumnTypes` with Auto Loader. Default value: `false`
`primitivesAsString` Type: `Boolean` Whether to infer primitive types like numbers and booleans as `StringType`. Default value: `false`
`readerCaseSensitive` Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Available in Databricks Runtime 13.3 and above. Default value: `true`
`rescuedDataColumn` Type: `String` Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. Default value: None
`singleVariantColumn` Type: `String` Whether to ingest the entire JSON document, parsed into a single Variant column with the given string as the column’s name. If disabled, the JSON fields will be ingested into their own columns. Default value: None
`timestampFormat` Type: `String` The format for parsing timestamp strings. Default value: `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`
`timeZone` Type: `String` The `java.time.ZoneId` to use when parsing timestamps and dates. Default value: None

Option

allowBackslashEscapingAnyCharacter

Type: Boolean

Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped.

Default value: false

allowComments

Type: Boolean

Whether to allow the use of Java, C, and C++ style comments ('/', '*', and '//' varieties) within parsed content or not.

Default value: false

allowNonNumericNumbers

Type: Boolean

Whether to allow the set of not-a-number (NaN) tokens as legal floating number values.

Default value: true

allowNumericLeadingZeros

Type: Boolean

Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001).

Default value: false

allowSingleQuotes

Type: Boolean

Whether to allow use of single quotes (apostrophe, character '\') for quoting strings (names and String values).

Default value: true

allowUnquotedControlChars

Type: Boolean

Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.

Default value: false

allowUnquotedFieldNames

Type: Boolean

Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification).

Default value: false

badRecordsPath

Type: String

The path to store files for recording the information about bad JSON records.

Using the badRecordsPath option in a file-based data source has the following limitations:

It is non-transactional and can lead to inconsistent results.
Transient errors are treated as failures.

Default value: None

columnNameOfCorruptRecord

Type: String

The column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED, this column will be empty.

Default value: _corrupt_record

dateFormat

Type: String

The format for parsing date strings.

Default value: yyyy-MM-dd

dropFieldIfAllNull

Type: Boolean

Whether to ignore columns of all null values or empty arrays and structs during schema inference.

Default value: false

encoding or charset

Type: String

The name of the encoding of the JSON files. See java.nio.charset.Charset for list of options. You cannot use UTF-16 and UTF-32 when multiline is true.

Default value: UTF-8

inferTimestamp

Type: Boolean

Whether to try and infer timestamp strings as a TimestampType. When set to true, schema inference might take noticeably longer. You must enable cloudFiles.inferColumnTypes to use with Auto Loader.

Default value: false

lineSep

Type: String

A string between two consecutive JSON records.

Default value: None, which covers \r, \r\n, and \n

locale

Type: String

A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the JSON.

Default value: US

mode

Type: String

Parser mode around handling malformed records. One of PERMISSIVE, DROPMALFORMED, or FAILFAST.

Default value: PERMISSIVE

multiLine

Type: Boolean

Whether the JSON records span multiple lines.

Default value: false

prefersDecimal

Type: Boolean

Attempts to infer strings as DecimalType instead of float or double type when possible. You must also use schema inference, either by enabling inferSchema or using cloudFiles.inferColumnTypes with Auto Loader.

Default value: false

primitivesAsString

Type: Boolean

Whether to infer primitive types like numbers and booleans as StringType.

Default value: false

readerCaseSensitive

Type: Boolean

Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Available in Databricks Runtime 13.3 and above.

Default value: true

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to What is the rescued data column?.

COPY INTO (legacy) does not support the rescued data column because you cannot manually set the schema using COPY INTO. Databricks recommends using Auto Loader for most ingestion scenarios.

Default value: None

singleVariantColumn

Type: String

Whether to ingest the entire JSON document, parsed into a single Variant column with the given string as the column’s name. If disabled, the JSON fields will be ingested into their own columns.

Default value: None

timestampFormat

Type: String

The format for parsing timestamp strings.

Default value: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]

timeZone

Type: String

The java.time.ZoneId to use when parsing timestamps and dates.

Default value: None

`CSV` options

Option
`badRecordsPath` Type: `String` The path to store files for recording the information about bad CSV records. Default value: None
`charToEscapeQuoteEscaping` Type: `Char` The character used to escape the character used for escaping quotes. For example, for the following record: `[ " a\\", b ]`: If the character to escape the `'\'` is undefined, the record won't be parsed. The parser will read characters: `[a],[\],["],[,],[ ],[b]` and throw an error because it cannot find a closing quote. If the character to escape the `'\'` is defined as `'\'`, the record will be read with 2 values: `[a\]` and `[b]`. Default value: `'\0'`
`columnNameOfCorruptRecord` Supported for Auto Loader. Not supported for `COPY INTO` (legacy). Type: `String` The column for storing records that are malformed and cannot be parsed. If the `mode` for parsing is set as `DROPMALFORMED`, this column will be empty. Default value: `_corrupt_record`
`comment` Type: `Char` Defines the character that represents a line comment when found in the beginning of a line of text. Use `'\0'` to disable comment skipping. Default value: `'\u0000'`
`dateFormat` Type: `String` The format for parsing date strings. Default value: `yyyy-MM-dd`
`emptyValue` Type: `String` String representation of an empty value. Default value: `""`
`encoding` or `charset` Type: `String` The name of the encoding of the CSV files. See `java.nio.charset.Charset` for the list of options. `UTF-16` and `UTF-32` cannot be used when `multiline` is `true`. Default value: `UTF-8`
`enforceSchema` Type: `Boolean` Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution. Default value: `true`
`escape` Type: `Char` The escape character to use when parsing the data. Default value: `'\'`
`header` Type: `Boolean` Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema. Default value: `false`
`ignoreLeadingWhiteSpace` Type: `Boolean` Whether to ignore leading whitespaces for each parsed value. Default value: `false`
`ignoreTrailingWhiteSpace` Type: `Boolean` Whether to ignore trailing whitespaces for each parsed value. Default value: `false`
`inferSchema` Type: `Boolean` Whether to infer the data types of the parsed CSV records or to assume all columns are of `StringType`. Requires an additional pass over the data if set to `true`. For Auto Loader, use `cloudFiles.inferColumnTypes` instead. Default value: `false`
`lineSep` Type: `String` A string between two consecutive CSV records. Default value: None, which covers `\r`, `\r\n`, and `\n`
`locale` Type: `String` A `java.util.Locale` identifier. Influences default date, timestamp, and decimal parsing within the CSV. Default value: `US`
`maxCharsPerColumn` Type: `Int` Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to `-1`, which means unlimited. Default value: `-1`
`maxColumns` Type: `Int` The hard limit of how many columns a record can have. Default value: `20480`
`mergeSchema` Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema. Default value: `false`
`mode` Type: `String` Parser mode around handling malformed records. One of `'PERMISSIVE'`, `'DROPMALFORMED'`, and `'FAILFAST'`. Default value: `PERMISSIVE`
`multiLine` Type: `Boolean` Whether the CSV records span multiple lines. Default value: `false`
`nanValue` Type: `String` The string representation of a non-a-number value when parsing `FloatType` and `DoubleType` columns. Default value: `"NaN"`
`negativeInf` Type: `String` The string representation of negative infinity when parsing `FloatType` or `DoubleType` columns. Default value: `"-Inf"`
`nullValue` Type: `String` String representation of a null value. Default value: `""`
`parserCaseSensitive` (deprecated) Type: `Boolean` While reading files, whether to align columns declared in the header with the schema case sensitively. This is `true` by default for Auto Loader. Columns that differ by case will be rescued in the `rescuedDataColumn` if enabled. This option has been deprecated in favor of `readerCaseSensitive`. Default value: `false`
`positiveInf` Type: `String` The string representation of positive infinity when parsing `FloatType` or `DoubleType` columns. Default value: `"Inf"`
`preferDate` Type: `Boolean` Attempts to infer strings as dates instead of timestamp when possible. You must also use schema inference, either by enabling `inferSchema` or using `cloudFiles.inferColumnTypes` with Auto Loader. Default value: `true`
`quote` Type: `Char` The character used for escaping values where the field delimiter is part of the value. Default value: `"`
`readerCaseSensitive` Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default value: `true`
`rescuedDataColumn` Type: `String` Whether to collect all data that can't be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. Default value: None
`sep` or `delimiter` Type: `String` The separator string between columns. Default value: `","`
`skipRows` Type: `Int` The number of rows from the beginning of the CSV file that should be ignored (including commented and empty rows). If `header` is true, the header will be the first unskipped and uncommented row. Default value: `0`
`timestampFormat` Type: `String` The format for parsing timestamp strings. Default value: `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`
`timeZone` Type: `String` The `java.time.ZoneId` to use when parsing timestamps and dates. Default value: None
`unescapedQuoteHandling` Type: `String` The strategy for handling unescaped quotes. Allowed options: `STOP_AT_CLOSING_QUOTE`: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found. `BACK_TO_DELIMITER`: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter defined by `sep` is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found. `STOP_AT_DELIMITER`: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter defined by `sep`, or a line ending is found in the input. `SKIP_VALUE`: If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the next delimiter is found) and the value set in `nullValue` will be produced instead. `RAISE_ERROR`: If unescaped quotes are found in the input, a `TextParsingException` will be thrown. Default value: `STOP_AT_DELIMITER`

Option

badRecordsPath

Type: String

The path to store files for recording the information about bad CSV records.

Default value: None

charToEscapeQuoteEscaping

Type: Char

The character used to escape the character used for escaping quotes. For example, for the following record: [ " a\\", b ]:

If the character to escape the '\' is undefined, the record won't be parsed. The parser will read characters: [a],[\],["],[,],[ ],[b] and throw an error because it cannot find a closing quote.
If the character to escape the '\' is defined as '\', the record will be read with 2 values: [a\] and [b].

Default value: '\0'

columnNameOfCorruptRecord

Supported for Auto Loader. Not supported for COPY INTO (legacy).

Type: String

The column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED, this column will be empty.

Default value: _corrupt_record

comment

Type: Char

Defines the character that represents a line comment when found in the beginning of a line of text. Use '\0' to disable comment skipping.

Default value: '\u0000'

dateFormat

Type: String

The format for parsing date strings.

Default value: yyyy-MM-dd

emptyValue

Type: String

String representation of an empty value.

Default value: ""

encoding or charset

Type: String

The name of the encoding of the CSV files. See java.nio.charset.Charset for the list of options. UTF-16 and UTF-32 cannot be used when multiline is true.

Default value: UTF-8

enforceSchema

Type: Boolean

Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution.

Default value: true

escape

Type: Char

The escape character to use when parsing the data.

Default value: '\'

header

Type: Boolean

Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema.

Default value: false

ignoreLeadingWhiteSpace

Type: Boolean

Whether to ignore leading whitespaces for each parsed value.

Default value: false

ignoreTrailingWhiteSpace

Type: Boolean

Whether to ignore trailing whitespaces for each parsed value.

Default value: false

inferSchema

Type: Boolean

Whether to infer the data types of the parsed CSV records or to assume all columns are of StringType. Requires an additional pass over the data if set to true. For Auto Loader, use cloudFiles.inferColumnTypes instead.

Default value: false

lineSep

Type: String

A string between two consecutive CSV records.

Default value: None, which covers \r, \r\n, and \n

locale

Type: String

A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the CSV.

Default value: US

maxCharsPerColumn

Type: Int

Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to -1, which means unlimited.

Default value: -1

maxColumns

Type: Int

The hard limit of how many columns a record can have.

Default value: 20480

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema.

Default value: false

mode

Type: String

Parser mode around handling malformed records. One of 'PERMISSIVE', 'DROPMALFORMED', and 'FAILFAST'.

Default value: PERMISSIVE

multiLine

Type: Boolean

Whether the CSV records span multiple lines.

Default value: false

nanValue

Type: String

The string representation of a non-a-number value when parsing FloatType and DoubleType columns.

Default value: "NaN"

negativeInf

Type: String

The string representation of negative infinity when parsing FloatType or DoubleType columns.

Default value: "-Inf"

nullValue

Type: String

String representation of a null value.

Default value: ""

parserCaseSensitive (deprecated)

Type: Boolean

While reading files, whether to align columns declared in the header with the schema case sensitively. This is true by default for Auto Loader. Columns that differ by case will be rescued in the rescuedDataColumn if enabled. This option has been deprecated in favor of readerCaseSensitive.

Default value: false

positiveInf

Type: String

The string representation of positive infinity when parsing FloatType or DoubleType columns.

Default value: "Inf"

preferDate

Type: Boolean

Attempts to infer strings as dates instead of timestamp when possible. You must also use schema inference, either by enabling inferSchema or using cloudFiles.inferColumnTypes with Auto Loader.

Default value: true

quote

Type: Char

The character used for escaping values where the field delimiter is part of the value.

Default value: "

readerCaseSensitive

Type: Boolean

Default value: true

rescuedDataColumn

Type: String

Whether to collect all data that can't be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?.

COPY INTO (legacy) does not support the rescued data column because you cannot manually set the schema using COPY INTO. Databricks recommends using Auto Loader for most ingestion scenarios.

Default value: None

sep or delimiter

Type: String

The separator string between columns.

Default value: ","

skipRows

Type: Int

The number of rows from the beginning of the CSV file that should be ignored (including commented and empty rows). If header is true, the header will be the first unskipped and uncommented row.

Default value: 0

timestampFormat

Type: String

The format for parsing timestamp strings.

Default value: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]

timeZone

Type: String

The java.time.ZoneId to use when parsing timestamps and dates.

Default value: None

unescapedQuoteHandling

Type: String

The strategy for handling unescaped quotes. Allowed options:

STOP_AT_CLOSING_QUOTE: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.
BACK_TO_DELIMITER: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter defined by sep is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.
STOP_AT_DELIMITER: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter defined by sep, or a line ending is found in the input.
SKIP_VALUE: If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the next delimiter is found) and the value set in nullValue will be produced instead.
RAISE_ERROR: If unescaped quotes are found in the input, a TextParsingException will be thrown.

Default value: STOP_AT_DELIMITER

`XML` options

Option	Description	Scope
`rowTag`	The row tag of the XML files to treat as a row. In the example XML `<books> <book><book>...<books>`, the appropriate value is `book`. This is a required option.	read
`samplingRatio`	Defines a fraction of rows used for schema inference. XML built-in functions ignore this option. Default: `1.0`.	read
`excludeAttribute`	Whether to exclude attributes in elements. Default: `false`.	read
`mode`	Mode for dealing with corrupt records during parsing. `PERMISSIVE`: For corrupted records, puts the malformed string into a field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. To keep corrupt records, you can set a `string` type field named `columnNameOfCorruptRecord` in a user-defined schema. If a schema does not have the field, corrupt records are dropped during parsing. When inferring a schema, the parser implicitly adds a `columnNameOfCorruptRecord` field in an output schema. `DROPMALFORMED`: Ignores corrupted records. This mode is unsupported for XML built-in functions. `FAILFAST`: Throws an exception when the parser meets corrupted records.	read
`inferSchema`	If `true`, attempts to infer an appropriate type for each resulting DataFrame column. If `false`, all resulting columns are of `string` type. Default: `true`. XML built-in functions ignore this option.	read
`columnNameOfCorruptRecord`	Allows renaming the new field that contains a malformed string created by `PERMISSIVE` mode. Default: `spark.sql.columnNameOfCorruptRecord`.	read
`attributePrefix`	The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is `_`. Can be empty for reading XML, but not for writing.	read, write
`valueTag`	The tag used for the character data within elements that also have attribute(s) or child element(s) elements. User can specify the `valueTag` field in the schema or it will be added automatically during schema inference when character data is present in elements with other elements or attributes. Default: `_VALUE`	read,write
`encoding`	For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. Default: `UTF-8`.	read, write
`ignoreSurroundingSpaces`	Defines whether surrounding white spaces from values being read should be skipped. Default: `true`. Whitespace-only character data are ignored.	read
`rowValidationXSDPath`	Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred.	read
`ignoreNamespace`	If `true`, namespaces' prefixes on XML elements and attributes are ignored. Tags `<abc:author>` and `<def:author>`, for example, are treated as if both are just `<author>`. Namespaces cannot be ignored on the `rowTag` element, only its read children. XML parsing is not namespace-aware even if `false`. Default: `false`.	read
`timestampFormat`	Custom timestamp format string that follows the datetime pattern format. This applies to `timestamp` type. Default: `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`.	read, write
`timestampNTZFormat`	Custom format string for timestamp without timezone that follows the datetime pattern format. This applies to TimestampNTZType type. Default: `yyyy-MM-dd'T'HH:mm:ss[.SSS]`	read, write
`dateFormat`	Custom date format string that follows the datetime pattern format. This applies to date type. Default: `yyyy-MM-dd`.	read, write
`locale`	Sets a locale as a language tag in IETF BCP 47 format. For instance, `locale` is used while parsing dates and timestamps. Default: `en-US`.	read
`rootTag`	Root tag of the XML files. For example, in `<books> <book><book>...</books>`, the appropriate value is `books`. You can include basic attributes by specifying a value like `books foo="bar"`. Default: `ROWS`.	write
`declaration`	Content of XML declaration to write at the start of every output XML file, before the `rootTag`. For example, a value of `foo` causes `<?xml foo?>` to be written. Set to an empty string to suppress. Default: `version="1.0"` `encoding="UTF-8" standalone="yes"`.	write
`arrayElementName`	Name of XML element that encloses each element of an array-valued column when writing. Default: `item`.	write
`nullValue`	Sets the string representation of a null value. Default: string `null`. When this is `null`, the parser does not write attributes and elements for fields.	read, write
`compression`	Compression code to use when saving to file. This can be one of the known case-insensitive shortened names (`none`, `bzip2`, `gzip`,`lz4`, `snappy`, and `deflate`). XML built-in functions ignore this option. Default: `none`.	write
`validateName`	If true, throws an error on XML element name validation failure. For example, SQL field names can have spaces, but XML element names cannot. Default: `true`.	write
`readerCaseSensitive`	Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default: `true`.	read
`rescuedDataColumn`	Whether to collect all data that can't be parsed due to a data type mismatch and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, see What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. Default: None.	read
`singleVariantColumn`	Specifies the name of the single variant column. If this option is specified for reading, parse the entire XML record into a single Variant column with the given option string value as the column’s name. If this option is provided for writing, write the value of the single Variant column to XML files. Default: `none`.	read, write

`PARQUET` options

Option
`datetimeRebaseMode` Type: `String` Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: `EXCEPTION`, `LEGACY`, and `CORRECTED`. Default value: `LEGACY`
`int96RebaseMode` Type: `String` Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Allowed values: `EXCEPTION`, `LEGACY`, and `CORRECTED`. Default value: `LEGACY`
`mergeSchema` Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. Default value: `false`
`readerCaseSensitive` Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default value: `true`
`rescuedDataColumn` Type: `String` Whether to collect all data that can't be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. Default value: None

Option

datetimeRebaseMode

Type: String

Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: EXCEPTION, LEGACY, and CORRECTED.

Default value: LEGACY

int96RebaseMode

Type: String

Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Allowed values: EXCEPTION, LEGACY, and CORRECTED.

Default value: LEGACY

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.

Default value: false

readerCaseSensitive

Type: Boolean

Default value: true

rescuedDataColumn

Type: String

COPY INTO (legacy) does not support the rescued data column because you cannot manually set the schema using COPY INTO. Databricks recommends using Auto Loader for most ingestion scenarios.

Default value: None

`AVRO` options

Option
`avroSchema` Type: `String` Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result will contain the new column too. Default value: None
`datetimeRebaseMode` Type: `String` Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: `EXCEPTION`, `LEGACY`, and `CORRECTED`. Default value: `LEGACY`
`mergeSchema` Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. `mergeSchema` for Avro does not relax data types. Default value: `false`
`readerCaseSensitive` Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default value: `true`
`rescuedDataColumn` Type: `String` Whether to collect all data that can't be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. For more details refer to What is the rescued data column?. Default value: None

Option

avroSchema

Type: String

Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result will contain the new column too.

Default value: None

datetimeRebaseMode

Type: String

Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: EXCEPTION, LEGACY, and CORRECTED.

Default value: LEGACY

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file. mergeSchema for Avro does not relax data types.

Default value: false

readerCaseSensitive

Type: Boolean

Default value: true

rescuedDataColumn

Type: String

COPY INTO (legacy) does not support the rescued data column because you cannot manually set the schema using COPY INTO. Databricks recommends using Auto Loader for most ingestion scenarios.

For more details refer to What is the rescued data column?.

Default value: None

`BINARYFILE` options

Binary files do not have any additional configuration options.

`TEXT` options

Option
`encoding` Type: `String` The name of the encoding of the TEXT file line separator. For a list of options, see `java.nio.charset.Charset`. The content of the file is not affected by this option and is read as-is. Default value: `UTF-8`
`lineSep` Type: `String` A string between two consecutive TEXT records. Default value: None, which covers `\r`, `\r\n` and `\n`
`wholeText` Type: `Boolean` Whether to read a file as a single record. Default value: `false`

Option

encoding

Type: String

The name of the encoding of the TEXT file line separator. For a list of options, see java.nio.charset.Charset.

The content of the file is not affected by this option and is read as-is.

Default value: UTF-8

lineSep

Type: String

A string between two consecutive TEXT records.

Default value: None, which covers \r, \r\n and \n

wholeText

Type: Boolean

Whether to read a file as a single record.

Default value: false

`ORC` options

Option
`mergeSchema` Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. Default value: `false`

Option

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.

Default value: false

Syntax​

Parameters​

Invoke COPY INTO concurrently​

Access file metadata​

Format options​

Generic options​

JSON options​

CSV options​

XML options​

PARQUET options​

AVRO options​

BINARYFILE options​

TEXT options​

ORC options​

Related articles​

Syntax

Parameters

Invoke `COPY INTO` concurrently

Access file metadata

Format options

Generic options

`JSON` options

`CSV` options

`XML` options

`PARQUET` options

`AVRO` options

`BINARYFILE` options

`TEXT` options

`ORC` options

Related articles