COPY INTO
Loads data from a file location into a Delta table. This is a re-triable and idempotent operation—files in the source location that have already been loaded are skipped.
Syntax
COPY INTO target_table
FROM { source |
( SELECT expression_list FROM source ) }
FILEFORMAT = data_source
[ VALIDATE [ ALL | num_rows ROWS ] ]
[ FILES = ( file_name [, ...] ) | PATTERN = regex_pattern ]
[ FORMAT_OPTIONS ( { data_source_reader_option = value } [, ...] ) ]
[ COPY_OPTIONS ( { copy_option = value } [, ...] ) ]
source
<path_as_string> [ WITH (
[ CREDENTIAL { <named_credential> | (temporary_credential_options) } ]
[ ENCRYPTION (encryption_options) ]
) ]
Parameters
-
Identifies an existing Delta table. The target_table must not include a temporal specification.
If the table name is provided in the form of a location, such as:
delta.`/path/to/table`
, Unity Catalog can govern access to the locations that are being written to. You can write to an external location by:Defining the location as an external location and having
WRITE FILES
permissions on that external location.Having
WRITE FILES
permissions on a named storage credential that provide authorization to write to a location using:COPY INTO delta.`/some/location` WITH (CREDENTIAL <named_credential>)
See Manage external locations and storage credentials for more details.
Preview
Unity Catalog is in Public Preview. The account console UI for Unity Catalog is in Private Preview. To participate in the preview, contact your Databricks representative.
source
The file location to load the data from. Files in this location must have the format specified in
FILEFORMAT
. The location is provided in the form of a URI.Access to the source location can be provided through:
A cluster instance profile.
Inline temporary credentials.
Defining the source location as an external location and having
READ FILES
permissions on the external location through Unity Catalog.Using a named storage credential with
READ FILES
permissions that provide authorization to read from a location through Unity Catalog.
Preview
Unity Catalog is in Public Preview. The account console UI for Unity Catalog is in Private Preview. To participate in the preview, contact your Databricks representative.
You do not need to provide inline or named credentials if the path is already defined as an external location, which you have permissions to use. See Manage external locations and storage credentials for more details.
Note
If the source file path is a root path, please add a slash (
/
) at the end of the file path, for example,s3://my-bucket/
.Accepted credential options are:
AWS_ACCESS_KEY
,AWS_SECRET_KEY
, andAWS_SESSION_TOKEN
for AWS S3AZURE_SAS_TOKEN
for ADLS Gen2 and Azure Blob Storage
Accepted encryption options are:
TYPE = 'AWS_SSE_C'
, andMASTER_KEY
for AWS S3
SELECT expression_list
Selects the specified columns or expressions from the source data before copying into the Delta table. The expressions can be anything you use with
SELECT
statements, including window operations. You can use aggregation expressions only for global aggregates–you cannotGROUP BY
on columns with this syntax.FILEFORMAT = data_source
The format of the source files to load. One of
CSV
,JSON
,AVRO
,ORC
,PARQUET
,TEXT
,BINARYFILE
.VALIDATE
The data that is to be loaded into a table is validated but not written to the table. These validations include:
Whether the data can be parsed.
Whether the schema matches that of the table or if the schema needs to be evolved.
Whether all nullability and check constraints are met.
The default is to validate all of the data that is to be loaded. You can provide a number of rows to be validated with the
ROWS
keyword, such asVALIDATE 15 ROWS
. TheCOPY INTO
statement returns a preview of the data of 50 rows or less, when a number of less than 50 is used with theROWS
keyword).Note
VALIDATE mode is available in Databricks Runtime 10.3 and above.
FILES
A list of file names to load, with length up to 1000. Cannot be specified with
PATTERN
.PATTERN
A regex pattern that identifies the files to load from the source directory. Cannot be specified with
FILES
.FORMAT_OPTIONS
Options to be passed to the Apache Spark data source reader for the specified format. See Format options for each file format.
COPY_OPTIONS
Options to control the operation of the
COPY INTO
command.force
: boolean, defaultfalse
. If set totrue
, idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.mergeSchema
: boolean, defaultfalse
. If set totrue
, the schema can be evolved according to the incoming data. To evolve the schema of a table, you must have OWN permissions on the table.
Note
mergeSchema
option is available in Databricks Runtime 10.3 and above.
Specifying temporary credentials or encryption options to access data
Note
Credential and encryption options are available in Databricks Runtime 10.2 and above.
If your cluster does not have permissions to read your source files, you can use temporary credentials to access the data. COPY INTO
supports:
Azure SAS tokens to read data from ADLS Gen2 and Azure Blob Storage. Azure Blob Storage temporary tokens are at the container level, whereas ADLS Gen2 tokens can be at the directory level in addition to container level. Databricks recommends using directory level SAS tokens as much as possible. You will need the SAS token to have “Read”, “List”, and “Permissions” permissions.
AWS STS tokens to read data from AWS S3. Your tokens should have the “s3:GetObject*”, “s3:ListBucket”, and “s3:GetBucketLocation” permissions.
Warning
Databricks recommends that you set sufficiently long yet short expiration horizons for temporary credentials to prevent misuse in the case that they are inadvertently exposed.
COPY INTO
supports loading encrypted data from AWS S3. To load encrypted data, provide the type of encryption and the key to decrypt the data.
Load data using temporary credentials
The following example loads data from S3 and ADLS Gen2 and leverages temporary credentials to provide access to the source data.
COPY INTO my_json_data
FROM 's3://my-bucket/jsonData' WITH (
CREDENTIAL (AWS_ACCESS_KEY = '...', AWS_SECRET_KEY = '...', AWS_SESSION_TOKEN = '...')
)
FILEFORMAT = JSON
COPY INTO my_json_data
FROM 'abfss://container@storageAccount.dfs.core.windows.net/jsonData' WITH (
CREDENTIAL (AZURE_SAS_TOKEN = '...')
)
FILEFORMAT = JSON
Load encrypted data
The following example loads data from S3 by providing customer-provided encryption keys.
COPY INTO my_json_data
FROM 's3://my-bucket/jsonData' WITH (
ENCRYPTION (TYPE = 'AWS_SSE_C', MASTER_KEY = '...')
)
FILEFORMAT = JSON
-
Optional name of the credential used to access or write to the storage location. You use this credential only if the file location is not included in an external location.
Examples
Load CSV files
The following example loads CSV files from Azure Data Lake Storage Gen2 under abfss://container@storageAccount.dfs.core.windows.net/base/path/folder1
into a Delta table at abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target
.
> COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
FROM (SELECT key, index, textData, 'constant_value'
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path')
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'
FORMAT_OPTIONS('header' = 'true')
-- The example below loads CSV files without headers on ADLS Gen2 using COPY INTO.
-- By casting the data and renaming the columns, you can put the data in the schema you want
> COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
FROM (SELECT _c0::bigint key, _c1::int index, _c2 textData
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path')
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'
Load JSON data
The following example loads JSON data from 5 files on AWS S3 into the Delta table called my_json_data
. This table must be created before COPY INTO
can be executed. If any data had already been loaded from one of the files, the data will not be reloaded for that file.
> COPY INTO my_json_data
FROM 's3://my-bucket/jsonData'
FILEFORMAT = JSON
FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')
-- The second execution will not copy any data since the first command already loaded the data
> COPY INTO my_json_data
FROM 's3://my-bucket/jsonData'
FILEFORMAT = JSON
FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')
Load Avro data
The following example loads Avro data on Google Cloud Storage using additional SQL expressions as part of the SELECT
statement.
> COPY INTO my_delta_table
FROM (SELECT to_date(dt) dt, event as measurement, quantity::double
FROM 'gs://my-bucket/avroData')
FILEFORMAT = AVRO
Load JSON data using a credentials for source and target
The following example loads JSON data from a file on AWS S3 into the external Delta table called my_json_data
.
This table must be created before COPY INTO
can be executed.
The command uses one existing credential to write to external Delta table and another to read from the s3 location.
> COPY INTO my_json_data WITH (CREDENTIAL target_credential)
FROM 's3://my-bucket/jsonData' WITH (CREDENTIAL source_credential)
FILEFORMAT = JSON
FILES = ('f.json')
Access file metadata
To learn how to access metadata for file-based data sources, see File metadata column.
Format options
Generic options
The following options apply to all file formats.
Option |
---|
modifiedAfter Type: An optional timestamp to ingest files that have a modification timestamp after the provided timestamp. Default value: None |
modifiedBefore Type: An optional timestamp to ingest files that have a modification timestamp before the provided timestamp. Default value: None |
pathGlobFilter Type: A potential glob pattern to provide for choosing files. Equivalent to
Default value: None |
recursiveFileLookup Type: Whether to load data recursively within the base directory and skip partition inference. Default value: |
JSON
options
Option |
---|
allowBackslashEscapingAnyCharacter Type: Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped. Default value: |
allowComments Type: Whether to allow the use of Java, C, and C++ style comments
( Default value: |
allowNonNumericNumbers Type: Whether to allow the set of not-a-number ( Default value: |
allowNumericLeadingZeros Type: Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001). Default value: |
allowSingleQuotes Type: Whether to allow use of single quotes (apostrophe,
character Default value: |
allowUnquotedControlChars Type: Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. Default value: |
allowUnquotedFieldNames Type: Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification). Default value: |
badRecordsPath Type: The path to store files for recording the information about bad JSON records. Default value: None |
columnNameOfCorruptRecord Type: The column for storing records that are malformed and cannot be parsed.
If the Default value: |
dateFormat Type: The format for parsing date strings. Default value: |
dropFieldIfAllNull Type: Whether to ignore columns of all null values or empty arrays and structs during schema inference. Default value: |
encoding or charset Type: The name of the encoding of the JSON files. See Default value: |
inferTimestamp Type: Whether to try and infer timestamp strings as a Default value: |
lineSep Type: A string between two consecutive JSON records. Default value: None, which covers |
locale Type: A Default value: |
mode Type: Parser mode around handling malformed records. One of Default value: |
multiLine Type: Whether the JSON records span multiple lines. Default value: |
prefersDecimal Type: Whether to infer floats and doubles as Default value: |
primitivesAsString Type: Whether to infer primitive types like numbers and booleans as Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to Rescued data column. Default value: None |
timestampFormat Type: The format for parsing timestamp strings. Default value: |
timeZone Type: The Default value: None |
CSV
options
Option |
---|
badRecordsPath Type: The path to store files for recording the information about bad CSV records. Default value: None |
charToEscapeQuoteEscaping Type: The character used to escape the character used for escaping quotes.
For example, for the following record:
Default value: |
columnNameOfCorruptRecord Type: A column for storing records that are malformed and cannot be parsed.
If the Default value: |
comment Type: Defines the character that represents a line comment when found in the
beginning of a line of text. Use Default value: |
dateFormat Type: The format for parsing date strings. Default value: |
emptyValue Type: String representation of an empty value. Default value: |
encoding or charset Type: The name of the encoding of the CSV files. See Default value: |
enforceSchema Type: Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution. Default value: |
escape Type: The escape character to use when parsing the data. Default value: |
header Type: Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema. Default value: |
ignoreLeadingWhiteSpace Type: Whether to ignore leading whitespaces for each parsed value. Default value: |
ignoreTrailingWhiteSpace Type: Whether to ignore trailing whitespaces for each parsed value. Default value: |
inferSchema Type: Whether to infer the data types of the parsed CSV records or to assume all
columns are of Default value: |
lineSep Type: A string between two consecutive CSV records. Default value: None, which covers |
locale Type: A Default value: |
maxCharsPerColumn Type: Maximum number of characters expected from a value to parse. Can be used to
avoid memory errors. Defaults to Default value: |
maxColumns Type: The hard limit of how many columns a record can have. Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema. Default value: |
mode Type: Parser mode around handling malformed records. One of Default value: |
multiLine Type: Whether the CSV records span multiple lines. Default value: |
nanValue Type: The string representation of a non-a-number value when parsing Default value: |
negativeInf Type: The string representation of negative infinity when parsing Default value: |
nullValue Type: String representation of a null value. Default value: |
parserCaseSensitive (deprecated) Type: While reading files, whether to align columns declared in the header with the
schema case sensitively. This is Default value: |
positiveInf Type: The string representation of positive infinity when parsing Default value: |
quote Type: The character used for escaping values where the field delimiter is part of the value. Default value: |
rescuedDataColumn Type: Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data column. Default value: None |
sep or delimiter Type: The separator string between columns. Default value: |
timestampFormat Type: The format for parsing timestamp strings. Default value: |
timeZone Type: The Default value: None |
unescapedQuoteHandling Type: The strategy for handling unescaped quotes. Allowed options:
Default value: |
PARQUET
options
Option |
---|
datetimeRebaseMode Type: Controls the rebasing of the DATE and TIMESTAMP values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
int96RebaseMode Type: Controls the rebasing of the INT96 timestamp values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Default value: |
AVRO
options
Option |
---|
avroSchema Type: Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result will contain the new column too. Default value: None |
datetimeRebaseMode Type: Controls the rebasing of the DATE and TIMESTAMP values between Julian and
Proleptic Gregorian calendars. Allowed values: Default value: |
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema
of each file.
Default value: |
BINARYFILE
options
Binary files do not have any additional configuration options.
TEXT
options
Option |
---|
encoding Type: The name of the encoding of the TEXT files. See Default value: |
lineSep Type: A string between two consecutive TEXT records. Default value: None, which covers |
wholeText Type: Whether to read a file as a single record. Default value: |
ORC
options
Option |
---|
mergeSchema Type: Whether to infer the schema across multiple files and to merge the schema of each file. Default value: |