Load data using COPY INTO with an instance profile
This article describes how to use the COPY INTO
command to load data from an Amazon S3 bucket in your AWS account into a table in Databricks SQL.
The steps in this article assume that your admin has configured a SQL warehouse to use an AWS instance profile so that you can access your source files in S3. If your admin configured a Unity Catalog external location with a storage credential, see Load data using COPY INTO with Unity Catalog volumes or external locations instead. If your admin gave you temporary credentials (an AWS access key ID, a secret key, and a session token), see Load data using COPY INTO with temporary credentials instead.
Databricks recommends using the COPY INTO command for incremental and bulk data loading with Databricks SQL.
Note
COPY INTO
works well for data sources that contain thousands of files. Databricks recommends that you use Auto Loader for loading millions of files, which is not supported in Databricks SQL.
Before you begin
Before you load data into Databricks, make sure you have the following:
Access to data in S3. Your admin must first complete the steps in Configure data access for ingestion so your Databricks SQL warehouse can read your source files.
A Databricks SQL warehouse that uses the instance profile that your admin created.
The Can manage permission on the SQL warehouse.
The fully qualified S3 URI.
Familiarity with the Databricks SQL user interface.
Step 1: Confirm access to data in cloud storage
To confirm that you have access to the correct data in cloud object storage, do the following:
In the sidebar, click Create > Query.
In the SQL editor’s menu bar, select a SQL warehouse.
In the SQL editor, paste the following code:
select * from csv.<path>
Replace
<path>
with the S3 URI that you received from your admin. For example,s3://<bucket>/<folder>/
.Click Run.
Step 2: Create a table
This step describes how to create a table in your Databricks workspace to hold the incoming data.
In the SQL editor, paste the following code:
CREATE TABLE <catalog_name>.<schema_name>.<table_name> ( tpep_pickup_datetime TIMESTAMP, tpep_dropoff_datetime TIMESTAMP, trip_distance DOUBLE, fare_amount DOUBLE, pickup_zip INT, dropoff_zip INT );
Click Run.
Step 3: Load data from cloud storage into the table
This step describes how to load data from an S3 bucket into the table you created in your Databricks workspace.
In the sidebar, click Create > Query.
In the SQL editor’s menu bar, select a SQL warehouse and make sure the SQL warehouse is running.
In the SQL editor, paste the following code. In this code, replace:
<s3-bucket>
with the name of your S3 bucket.<folder>
with the name of the folder in your S3 bucket.
COPY INTO <catalog-name>.<schema-name>.<table-name> FROM 's3://<s3-bucket>/<folder>/' FILEFORMAT = CSV FORMAT_OPTIONS ( 'header' = 'true', 'inferSchema' = 'true' ) COPY_OPTIONS ( 'mergeSchema' = 'true' ); SELECT * FROM <catalog_name>.<schema_name>.<table_name>;
Note
FORMAT_OPTIONS
differs depending onFILEFORMAT
. In this case, theheader
option instructs Databricks to treat the first row of the CSV file as a header, and theinferSchema
options instructs Databricks to automatically determine the data type of each field in the CSV file.Click Run.
Note
If you click Run again, no new data is loaded into the table. This is because the
COPY INTO
command only processes what it considers to be new data.
Clean up
You can clean up the associated resources in your workspace if you no longer want to keep them.
Additional resources
The COPY INTO reference article