COPY INTO

Loads data from a file location into a Delta table. This is a re-triable and idempotent operation—files in the source location that have already been loaded are skipped.

Syntax

COPY INTO target_table
  FROM { source |
         ( SELECT expression_list FROM source ) }
  FILEFORMAT = data_source
  [ VALIDATE [ ALL | num_rows ROWS ] ]
  [ FILES = ( file_name [, ...] ) | PATTERN = regex_pattern ]
  [ FORMAT_OPTIONS ( { data_source_reader_option = value } [, ...] ) ]
  [ COPY_OPTIONS ( { copy_option = value } [, ...] ) ]

source
  <path_as_string> [ WITH (
    [ CREDENTIAL { <named_credential> | (temporary_credential_options) } ]
    [ ENCRYPTION (encryption_options) ]
  ) ]

Parameters

  • target_table

    Identifies an existing Delta table. The target_table must not include a temporal specification.

    If the table name is provided in the form of a location, such as: delta.`/path/to/table` , Unity Catalog can govern access to the locations that are being written to. You can write to an external location by:

    • Defining the location as an external location and having WRITE FILES permissions on that external location.

    • Having WRITE FILES permissions on a named storage credential that provide authorization to write to a location using: COPY INTO delta.`/some/location` WITH (CREDENTIAL <named_credential>)

    See Manage external locations and storage credentials for more details.

    Preview

    Unity Catalog is in Public Preview. The account console UI for Unity Catalog is in Private Preview. To participate in the preview, contact your Databricks representative.

  • source

    The file location to load the data from. Files in this location must have the format specified in FILEFORMAT. The location is provided in the form of a URI.

    Access to the source location can be provided through:

    • Inline temporary credentials.

    • Defining the source location as an external location and having READ FILES permissions on the external location through Unity Catalog.

    • Using a named storage credential with READ FILES permissions that provide authorization to read from a location through Unity Catalog.

    Preview

    Unity Catalog is in Public Preview. The account console UI for Unity Catalog is in Private Preview. To participate in the preview, contact your Databricks representative.

    You do not need to provide inline or named credentials if the path is already defined as an external location, which you have permissions to use. See Manage external locations and storage credentials for more details.

    Note

    If the source file path is a root path, please add a slash (/) at the end of the file path, for example, s3://my-bucket/.

    Accepted credential options are:

    • AWS_ACCESS_KEY, AWS_SECRET_KEY, and AWS_SESSION_TOKEN for AWS S3

    • AZURE_SAS_TOKEN for ADLS Gen2 and Azure Blob Storage

    Accepted encryption options are:

    • TYPE = 'AWS_SSE_C', and MASTER_KEY for AWS S3

  • SELECT expression_list

    Selects the specified columns or expressions from the source data before copying into the Delta table. The expressions can be anything you use with SELECT statements, including window operations. You can use aggregation expressions only for global aggregates–you cannot GROUP BY on columns with this syntax.

  • FILEFORMAT = data_source

    The format of the source files to load. One of CSV, JSON, AVRO, ORC, PARQUET, TEXT, BINARYFILE.

  • VALIDATE

    The data that is to be loaded into a table is validated but not written to the table. These validations include:

    • Whether the data can be parsed.

    • Whether the schema matches that of the table or if the schema needs to be evolved.

    • Whether all nullability and check constraints are met.

    The default is to validate all of the data that is to be loaded. You can provide a number of rows to be validated with the ROWS keyword, such as VALIDATE 15 ROWS. The COPY INTO statement returns a preview of the data of 50 rows or less, when a number of less than 50 is used with the ROWS keyword).

    Note

    VALIDATE mode is available in Databricks Runtime 10.3 and above.

  • FILES

    A list of file names to load, with length up to 1000. Cannot be specified with PATTERN.

  • PATTERN

    A regex pattern that identifies the files to load from the source directory. Cannot be specified with FILES.

  • FORMAT_OPTIONS

    Options to be passed to the Apache Spark data source reader for the specified format. See Format options for each file format.

  • COPY_OPTIONS

    Options to control the operation of the COPY INTO command.

    • force: boolean, default false. If set to true, idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.

    • mergeSchema: boolean, default false. If set to true, the schema can be evolved according to the incoming data. To evolve the schema of a table, you must have OWN permissions on the table.

    Note

    mergeSchema option is available in Databricks Runtime 10.3 and above.

Specifying temporary credentials or encryption options to access data

Note

Credential and encryption options are available in Databricks Runtime 10.2 and above.

If your cluster does not have permissions to read your source files, you can use temporary credentials to access the data. COPY INTO supports:

  • Azure SAS tokens to read data from ADLS Gen2 and Azure Blob Storage. Azure Blob Storage temporary tokens are at the container level, whereas ADLS Gen2 tokens can be at the directory level in addition to container level. Databricks recommends using directory level SAS tokens as much as possible. You will need the SAS token to have “Read”, “List”, and “Permissions” permissions.

  • AWS STS tokens to read data from AWS S3. Your tokens should have the “s3:GetObject*”, “s3:ListBucket”, and “s3:GetBucketLocation” permissions.

Warning

Databricks recommends that you set sufficiently long yet short expiration horizons for temporary credentials to prevent misuse in the case that they are inadvertently exposed.

COPY INTO supports loading encrypted data from AWS S3. To load encrypted data, provide the type of encryption and the key to decrypt the data.

Load data using temporary credentials

The following example loads data from S3 and ADLS Gen2 and leverages temporary credentials to provide access to the source data.

COPY INTO my_json_data
FROM 's3://my-bucket/jsonData' WITH (
  CREDENTIAL (AWS_ACCESS_KEY = '...', AWS_SECRET_KEY = '...', AWS_SESSION_TOKEN = '...')
)
FILEFORMAT = JSON

COPY INTO my_json_data
FROM 'abfss://container@storageAccount.dfs.core.windows.net/jsonData' WITH (
  CREDENTIAL (AZURE_SAS_TOKEN = '...')
)
FILEFORMAT = JSON

Load encrypted data

The following example loads data from S3 by providing customer-provided encryption keys.

COPY INTO my_json_data
FROM 's3://my-bucket/jsonData' WITH (
  ENCRYPTION (TYPE = 'AWS_SSE_C', MASTER_KEY = '...')
)
FILEFORMAT = JSON
  • credential_name

    Optional name of the credential used to access or write to the storage location. You use this credential only if the file location is not included in an external location.

Examples

Load CSV files

The following example loads CSV files from Azure Data Lake Storage Gen2 under abfss://container@storageAccount.dfs.core.windows.net/base/path/folder1 into a Delta table at abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target.

> COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
    FROM (SELECT key, index, textData, 'constant_value'
            FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path')
    FILEFORMAT = CSV
    PATTERN = 'folder1/file_[a-g].csv'
    FORMAT_OPTIONS('header' = 'true')

-- The example below loads CSV files without headers on ADLS Gen2 using COPY INTO.
-- By casting the data and renaming the columns, you can put the data in the schema you want
> COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
    FROM (SELECT _c0::bigint key, _c1::int index, _c2 textData
          FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path')
    FILEFORMAT = CSV
    PATTERN = 'folder1/file_[a-g].csv'

Load JSON data

The following example loads JSON data from 5 files on AWS S3 into the Delta table called my_json_data. This table must be created before COPY INTO can be executed. If any data had already been loaded from one of the files, the data will not be reloaded for that file.

> COPY INTO my_json_data
    FROM 's3://my-bucket/jsonData'
    FILEFORMAT = JSON
    FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')

-- The second execution will not copy any data since the first command already loaded the data
> COPY INTO my_json_data
    FROM 's3://my-bucket/jsonData'
    FILEFORMAT = JSON
    FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')

Load Avro data

The following example loads Avro data on Google Cloud Storage using additional SQL expressions as part of the SELECT statement.

> COPY INTO my_delta_table
    FROM (SELECT to_date(dt) dt, event as measurement, quantity::double
            FROM 'gs://my-bucket/avroData')
    FILEFORMAT = AVRO

Load JSON data using a credentials for source and target

The following example loads JSON data from a file on AWS S3 into the external Delta table called my_json_data. This table must be created before COPY INTO can be executed. The command uses one existing credential to write to external Delta table and another to read from the s3 location.

> COPY INTO my_json_data WITH (CREDENTIAL target_credential)
    FROM 's3://my-bucket/jsonData' WITH (CREDENTIAL source_credential)
    FILEFORMAT = JSON
    FILES = ('f.json')

Access file metadata

To learn how to access metadata for file-based data sources, see File metadata column.

Format options

Generic options

The following options apply to all file formats.

Option

modifiedAfter

Type: Timestamp String, for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp after the provided timestamp.

Default value: None

modifiedBefore

Type: Timestamp String, for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp before the provided timestamp.

Default value: None

pathGlobFilter

Type: String

A potential glob pattern to provide for choosing files. Equivalent to PATTERN in COPY INTO.

Default value: None

recursiveFileLookup

Type: Boolean

Whether to load data recursively within the base directory and skip partition inference.

Default value: false

JSON options

Option

allowBackslashEscapingAnyCharacter

Type: Boolean

Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped.

Default value: false

allowComments

Type: Boolean

Whether to allow the use of Java, C, and C++ style comments ('/', '*', and '//' varieties) within parsed content or not.

Default value: false

allowNonNumericNumbers

Type: Boolean

Whether to allow the set of not-a-number (NaN) tokens as legal floating number values.

Default value: true

allowNumericLeadingZeros

Type: Boolean

Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001).

Default value: false

allowSingleQuotes

Type: Boolean

Whether to allow use of single quotes (apostrophe, character '\') for quoting strings (names and String values).

Default value: true

allowUnquotedControlChars

Type: Boolean

Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.

Default value: false

allowUnquotedFieldNames

Type: Boolean

Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification).

Default value: false

badRecordsPath

Type: String

The path to store files for recording the information about bad JSON records.

Default value: None

columnNameOfCorruptRecord

Type: String

The column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED, this column will be empty.

Default value: _corrupt_record

dateFormat

Type: String

The format for parsing date strings.

Default value: yyyy-MM-dd

dropFieldIfAllNull

Type: Boolean

Whether to ignore columns of all null values or empty arrays and structs during schema inference.

Default value: false

encoding or charset

Type: String

The name of the encoding of the JSON files. See java.nio.charset.Charset for list of options. You cannot use UTF-16 and UTF-32 when multiline is true.

Default value: UTF-8

inferTimestamp

Type: Boolean

Whether to try and infer timestamp strings as a TimestampType. When set to true, schema inference may take noticeably longer.

Default value: false

lineSep

Type: String

A string between two consecutive JSON records.

Default value: None, which covers \r, \r\n, and \n

locale

Type: String

A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the JSON.

Default value: US

mode

Type: String

Parser mode around handling malformed records. One of 'PERMISSIVE', 'DROPMALFORMED', or 'FAILFAST'.

Default value: PERMISSIVE

multiLine

Type: Boolean

Whether the JSON records span multiple lines.

Default value: false

prefersDecimal

Type: Boolean

Whether to infer floats and doubles as DecimalType during schema inference.

Default value: false

primitivesAsString

Type: Boolean

Whether to infer primitive types like numbers and booleans as StringType.

Default value: false

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to Rescued data column.

Default value: None

timestampFormat

Type: String

The format for parsing timestamp strings.

Default value: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]

timeZone

Type: String

The java.time.ZoneId to use when parsing timestamps and dates.

Default value: None

CSV options

Option

badRecordsPath

Type: String

The path to store files for recording the information about bad CSV records.

Default value: None

charToEscapeQuoteEscaping

Type: Char

The character used to escape the character used for escaping quotes. For example, for the following record: [ " a\\", b ]:

  • If the character to escape the '\' is undefined, the record won’t be parsed. The parser will read characters: [a],[\],["],[,],[ ],[b] and throw an error because it cannot find a closing quote.

  • If the character to escape the '\' is defined as '\', the record will be read with 2 values: [a\] and [b].

Default value: '\0'

columnNameOfCorruptRecord

Type: String

A column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED, this column will be empty.

Default value: _corrupt_record

comment

Type: Char

Defines the character that represents a line comment when found in the beginning of a line of text. Use '\0' to disable comment skipping.

Default value: '#'

dateFormat

Type: String

The format for parsing date strings.

Default value: yyyy-MM-dd

emptyValue

Type: String

String representation of an empty value.

Default value: ""

encoding or charset

Type: String

The name of the encoding of the CSV files. See java.nio.charset.Charset for the list of options. UTF-16 and UTF-32 cannot be used when multiline is true.

Default value: UTF-8

enforceSchema

Type: Boolean

Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution.

Default value: true

escape

Type: Char

The escape character to use when parsing the data.

Default value: '\'

header

Type: Boolean

Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema.

Default value: false

ignoreLeadingWhiteSpace

Type: Boolean

Whether to ignore leading whitespaces for each parsed value.

Default value: false

ignoreTrailingWhiteSpace

Type: Boolean

Whether to ignore trailing whitespaces for each parsed value.

Default value: false

inferSchema

Type: Boolean

Whether to infer the data types of the parsed CSV records or to assume all columns are of StringType. Requires an additional pass over the data if set to true.

Default value: false

lineSep

Type: String

A string between two consecutive CSV records.

Default value: None, which covers \r, \r\n, and \n

locale

Type: String

A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the CSV.

Default value: US

maxCharsPerColumn

Type: Int

Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to -1, which means unlimited.

Default value: -1

maxColumns

Type: Int

The hard limit of how many columns a record can have.

Default value: 20480

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema.

Default value: false

mode

Type: String

Parser mode around handling malformed records. One of 'PERMISSIVE', 'DROPMALFORMED', and 'FAILFAST'.

Default value: PERMISSIVE

multiLine

Type: Boolean

Whether the CSV records span multiple lines.

Default value: false

nanValue

Type: String

The string representation of a non-a-number value when parsing FloatType and DoubleType columns.

Default value: "NaN"

negativeInf

Type: String

The string representation of negative infinity when parsing FloatType or DoubleType columns.

Default value: "-Inf"

nullValue

Type: String

String representation of a null value.

Default value: ""

parserCaseSensitive (deprecated)

Type: Boolean

While reading files, whether to align columns declared in the header with the schema case sensitively. This is true by default for Auto Loader. Columns that differ by case will be rescued in the rescuedDataColumn if enabled. This option has been deprecated in favor of readerCaseSensitive.

Default value: false

positiveInf

Type: String

The string representation of positive infinity when parsing FloatType or DoubleType columns.

Default value: "Inf"

quote

Type: Char

The character used for escaping values where the field delimiter is part of the value.

Default value: '\'

rescuedDataColumn

Type: String

Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data column.

Default value: None

sep or delimiter

Type: String

The separator string between columns.

Default value: ","

timestampFormat

Type: String

The format for parsing timestamp strings.

Default value: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]

timeZone

Type: String

The java.time.ZoneId to use when parsing timestamps and dates.

Default value: None

unescapedQuoteHandling

Type: String

The strategy for handling unescaped quotes. Allowed options:

  • STOP_AT_CLOSING_QUOTE: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.

  • BACK_TO_DELIMITER: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter defined by sep is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.

  • STOP_AT_DELIMITER: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter defined by sep, or a line ending is found in the input.

  • SKIP_VALUE: If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the next delimiter is found) and the value set in nullValue will be produced instead.

  • RAISE_ERROR: If unescaped quotes are found in the input, a TextParsingException will be thrown.

Default value: STOP_AT_DELIMITER

PARQUET options

Option

datetimeRebaseMode

Type: String

Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: EXCEPTION, LEGACY, and CORRECTED.

Default value: LEGACY

int96RebaseMode

Type: String

Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Allowed values: EXCEPTION, LEGACY, and CORRECTED.

Default value: LEGACY

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.

Default value: false

AVRO options

Option

avroSchema

Type: String

Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result will contain the new column too.

Default value: None

datetimeRebaseMode

Type: String

Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: EXCEPTION, LEGACY, and CORRECTED.

Default value: LEGACY

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file. mergeSchema for Avro does not relax data types.

Default value: false

BINARYFILE options

Binary files do not have any additional configuration options.

TEXT options

Option

encoding

Type: String

The name of the encoding of the TEXT files. See java.nio.charset.Charset for list of options.

Default value: UTF-8

lineSep

Type: String

A string between two consecutive TEXT records.

Default value: None, which covers \r, \r\n and \n

wholeText

Type: Boolean

Whether to read a file as a single record.

Default value: false

ORC options

Option

mergeSchema

Type: Boolean

Whether to infer the schema across multiple files and to merge the schema of each file.

Default value: false