Learn how to use
COPY INTO to ingest data to Unity Catalog managed or external tables from any source and file format supported by COPY INTO. Unity Catalog adds new options for configuring secure access to raw data. You can use Unity Catalog volumes or external locations to access data in cloud object storage.
Databricks recommends using volumes to access files in cloud storage as part of the ingestion process using
COPY INTO. For more information about recommendations for using volumes and external locations, see Unity Catalog best practices.
This article describes how to use the
COPY INTO command to load data from an Amazon S3 bucket in your AWS account into a table in Databricks SQL.
The steps in this article assume that your admin has configured a Unity Catalog volume or external location so that you can access your source files in S3. If your admin configured a compute resource to use an AWS instance profile, see Load data using COPY INTO with an instance profile or Tutorial: COPY INTO with Spark SQL instead. If your admin gave you temporary credentials (an AWS access key ID, a secret key, and a session token), see Load data using COPY INTO with temporary credentials instead.
Before you use COPY INTO to load data from a Unity Catalog volume or from a cloud object storage path that’s defined as a Unity Catalog external location, you must have the following:
READ VOLUMEprivilege on a volume or the
READ FILESprivilege on an external location.
For more information about creating volumes, see Create volumes.
For more information about creating external locations, see Manage external locations and storage credentials.
The path to your source data in the form of a cloud object storage URL or a volume path.
Example cloud object storage URLs:
Example volume path:
USE SCHEMAprivilege on the schema that contains the target table.
USE CATALOGprivilege on the parent catalog.
For more information about Unity Catalog privileges, see Unity Catalog privileges and securable objects.
To load data from a Unity Catalog volume, you must have the
READ VOLUME privilege. Volume privileges apply to all nested directories under the specified volume.
For example, if you have access to a volume with the path
/Volumes/quickstart_catalog/quickstart_schema/quickstart_volume/, the following commands are valid:
COPY INTO landing_table FROM '/Volumes/quickstart_catalog/quickstart_schema/quickstart_volume/raw_data' FILEFORMAT = PARQUET; COPY INTO json_table FROM '/Volumes/quickstart_catalog/quickstart_schema/quickstart_volume/raw_data/json' FILEFORMAT = JSON;
Optionally, you can also use a volume path with the dbfs scheme. For example, the following commands are also valid:
COPY INTO landing_table FROM 'dbfs:/Volumes/quickstart_catalog/quickstart_schema/quickstart_volume/raw_data' FILEFORMAT = PARQUET; COPY INTO json_table FROM 'dbfs:/Volumes/quickstart_catalog/quickstart_schema/quickstart_volume/raw_data/json' FILEFORMAT = JSON;
The following example loads data from S3 and ADLS Gen2 into a table using Unity Catalog external locations to provide access to the source code.
COPY INTO my_json_data FROM 's3://landing-bucket/json-data' FILEFORMAT = JSON; COPY INTO my_json_data FROM 'abfss://container@storageAccount.dfs.core.windows.net/jsonData' FILEFORMAT = JSON;
External location privileges apply to all nested directories under the specified location.
For example, if you have access to an external location defined with the URL
s3://landing-bucket/raw-data, the following commands are valid:
COPY INTO landing_table FROM 's3://landing-bucket/raw-data' FILEFORMAT = PARQUET; COPY INTO json_table FROM 's3://landing-bucket/raw-data/json' FILEFORMAT = JSON;
Permissions on this external location do not grant any privileges on directories above or parallel to the location specified. For example, neither of the following commands are valid:
COPY INTO parent_table FROM 's3://landing-bucket' FILEFORMAT = PARQUET; COPY INTO sibling_table FROM 's3://landing-bucket/json-data' FILEFORMAT = JSON;
You can target a Unity Catalog table using a three tier identifier (
<catalog_name>.<database_name>.<table_name>). You can use the
USE CATALOG <catalog_name> and
USE <database_name> commands to set the default catalog and database for your current query or notebook.