Common data loading patterns with COPY INTO

Learn common patterns for using COPY INTO to load data from file sources into Delta Lake.

There are many options for using COPY INTO. You can also Use temporary credentials to load data with COPY INTO in combination with these patterns.

See COPY INTO for a full reference of all options.

Create target tables for COPY INTO

COPY INTO must target an existing Delta table. In Databricks Runtime 11.0 and above, setting the schema for these tables is optional for formats that support schema evolution:

CREATE TABLE IF NOT EXISTS my_table
[(col_1 col_1_type, col_2 col_2_type, ...)]
[COMMENT <table_description>]
[TBLPROPERTIES (<table_properties>)];

Note that to infer schema with copy into, you must pass additional options:

COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

The following example creates a schemaless Delta table called my_pipe_data and loads a pipe-delimited CSV with a header:

CREATE TABLE IF NOT EXISTS my_pipe_data;

COPY INTO my_pipe_data
  FROM 's3a://my-bucket/pipeData'
  FILEFORMAT = CSV
  FORMAT_OPTIONS ('mergeSchema' = 'true',
                  'delimiter' = '|',
                  'header' = 'true')
  COPY_OPTIONS ('mergeSchema' = 'true');

Load JSON data with COPY INTO

The following example loads JSON data from five files on AWS S3 into the Delta table called my_json_data. This table must be created before COPY INTO can be executed. If any data was already loaded from one of the files, the data isn’t reloaded for that file.

COPY INTO my_json_data
  FROM 's3://my-bucket/jsonData'
  FILEFORMAT = JSON
  FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')

-- The second execution will not copy any data since the first command already loaded the data
COPY INTO my_json_data
  FROM 's3://my-bucket/jsonData'
  FILEFORMAT = JSON
  FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')

Load Avro data with COPY INTO

The following example loads Avro data on Google Cloud Storage using additional SQL expressions as part of the SELECT statement.

COPY INTO my_delta_table
  FROM (SELECT to_date(dt) dt, event as measurement, quantity::double
          FROM 'gs://my-bucket/avroData')
  FILEFORMAT = AVRO

Load CSV files with COPY INTO

The following example loads CSV files from Azure Data Lake Storage Gen2 under abfss://container@storageAccount.dfs.core.windows.net/base/path/folder1 into a Delta table at abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target.

COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
  FROM (SELECT key, index, textData, 'constant_value'
          FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path')
  FILEFORMAT = CSV
  PATTERN = 'folder1/file_[a-g].csv'
  FORMAT_OPTIONS('header' = 'true')

-- The example below loads CSV files without headers on ADLS Gen2 using COPY INTO.
-- By casting the data and renaming the columns, you can put the data in the schema you want
COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
  FROM (SELECT _c0::bigint key, _c1::int index, _c2 textData
        FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path')
  FILEFORMAT = CSV
  PATTERN = 'folder1/file_[a-g].csv'

Ignore corrupt files while loading data

If the data you’re loading can’t be read due to some corruption issue, those files can be skipped by setting ignoreCorruptFiles to true in the FORMAT_OPTIONS.

The result of the COPY INTO command returns how many files were skipped due to corruption in the num_skipped_corrupt_files column. This metric also shows up in the operationMetrics column under numSkippedCorruptFiles after running DESCRIBE HISTORY on the Delta table.

Corrupt files aren’t tracked by COPY INTO, so they can be reloaded in a subsequent run if the corruption is fixed. You can see which files are corrupt by running COPY INTO in VALIDATE mode.

COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
[VALIDATE ALL]
FORMAT_OPTIONS ('ignoreCorruptFiles' = 'true')

Note

ignoreCorruptFiles is available in Databricks Runtime 11.0 and above.