Tutorial: Bulk load data into a table with COPY INTO
If you have data in cloud storage that needs to be bulk-loaded into a table in a Databricks workspace, the COPY INTO command is a great option. It supports incremental appends, simple transformations, and many data formats. It is ideal when the source directory contains a small number of files, for example thousands of files or fewer.
Note
Databricks recommends that you use Auto Loader for advanced use cases. Unlike COPY INTO
, Auto Loader supports handling millions of files per second, complex transformations, and merging change data.
In this tutorial, you use the COPY INTO
command to load data from an Amazon S3 bucket in your AWS account into a table in your Databricks workspace.
Requirements
A Databricks account, and a Databricks workspace within your account. To create these, see Sign up for a free trial.
A Databricks SQL endpoint or an all-purpose cluster within your workspace. To create a SQL endpoint, see Create a SQL endpoint. To create an all-purpose cluster, see Create a cluster.
Familiarity with the Databricks workspace user interface. See Navigate the workspace.
If you want to use Databricks SQL instead of a cluster, familiarity with the Databricks SQL user interface. See the Databricks SQL user guide.
An Amazon S3 bucket in your AWS account. To create a bucket, see Creating a bucket in the AWS documentation.
Access permissions in your AWS account to create policies, users, and access keys for users with AWS Identity and Access Management (IAM).
The AWS Command Line Interface (AWS CLI) installed on your local development machine. See Installing or updating the latest version of the AWS CLI on the AWS website.
The ability to create AWS Security Token Service (AWS STS) session tokens for your AWS account.
Step 1. Prepare the sample data
The COPY INTO
command needs data from a supported source to load into your Databricks workspace. Supported sources include CSV, JSON, Avro, ORC, Parquet, text, and binary files. This source can be anywhere that your Databricks workspace has access to.
This tutorial follows a scenario in which the data is in an Amazon S3 bucket:
To set things up, in this step you get a copy of some sample data from Databricks datasets. You then prepare that sample data to be stored in an existing S3 bucket in your AWS account.
In the next step, you upload this sample data to the S3 bucket.
In the third step, you set up access permissions for the
COPY INTO
command.Finally, you run the
COPY INTO
command to load the data from the S3 bucket back into your Databricks workspace.
Normally, you would not export sample data from your Databricks workspace and re-import it. However, this scenario needs to follow this particular workflow to prepare sample data for the COPY INTO
command.
To prepare the sample data, you can use the Databricks SQL editor or a notebook in your Databricks workspace.
To use the SQL editor:
In the SQL persona, on the sidebar, click Create > Query.
In the SQL editor’s menu bar, select the SQL endpoint that you created in the Requirements section, or select another available SQL endpoint that you want to use.
In the SQL editor, paste the following code:
SELECT * FROM samples.nyctaxi.trips
Click Run.
At the bottom of the editor, click the ellipses icon, and then click Download as CSV file.
Note
This dataset contains almost 22,000 rows of data. This tutorial downloads only the first 1,000 rows of data. To download all of the rows, clear the LIMIT 1000 box and then repeat steps 4-5.
To use a notebook:
In the Data Science & Engineering or Machine Learning persona, on the sidebar, click Create > Notebook.
In the Create Notebook dialog, enter a name for the notebook, for example,
NYC Taxi COPY INTO Sample
.For Default Language, select SQL.
For Cluster, select the cluster you created in the Requirements section, or select another available cluster that you want to use.
Click Create.
In the notebook’s menu bar, if the circle next to the name of the cluster does not contain a green check mark, click the drop-down arrow next to the cluster’s name, and then click Start Cluster. Click Confirm, and then wait until the circle contains a green check mark.
In the notebook’s first cell, paste the following code:
SELECT * FROM delta.`/databricks-datasets/nyctaxi-with-zipcodes/subsampled`
In the notebook’s menu bar, click Run All.
At the bottom of the cell, click the drop-down arrow next to the download icon, and then click Download preview.
Note
This dataset contains almost 22,000 rows of data. This tutorial downloads only the first 1,000 rows of data. To download all of the rows, click Download full results (max 1 million rows), and then click Re-execute and download.
Step 2: Upload the sample data to cloud storage
In this step, you upload the sample data from your local development machine into a folder inside of an S3 bucket in your AWS account.
Sign in to the AWS Management Console, and open the Amazon S3 console in your AWS account, typically at https://console.aws.amazon.com/s3.
Browse to and click on your existing S3 bucket. This tutorial uses a bucket named
nyctaxi-sample-data
.Click Create folder.
Enter a name for the folder and click Create folder. This tutorial uses a folder named
nyctaxi
.Click the
nyctaxi
folder.Click Upload.
Follow the on-screen instructions to upload the CSV file from the previous step into this folder.
Step 3: Create resources in your cloud account to access cloud storage
In this step, in your AWS account you create an IAM user that has just enough access to read the CSV file that you uploaded to S3.
Create the policy for the user: open the AWS IAM console in your AWS account, typically at https://console.aws.amazon.com/iam.
Click Policies.
Click Create Policy.
Click the JSON tab.
Replace the existing JSON code with the following code. In the code, replace:
nyctaxi-sample-data
with the name of your S3 bucket.nyctaxi
with the name of the folder within your S3 bucket.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "ReadOnlyAccessToTrips", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::nyctaxi-sample-data", "arn:aws:s3:::nyctaxi-sample-data/nyctaxi/*" ] } ] }
Click Next: Tags.
Click Next: Review.
Enter a name for the policy and click Create policy. This tutorial uses a fictitious policy named
nyctaxi-s3-ingest-policy
.Create the user: in the sidebar, click Users.
Click Add users.
Enter a name for the user. This tutorial uses a fictitious user named
nyctaxi-s3-ingest-user
.Select the Access key - Programmatic access box, and then click Next: Permissions.
Click Attach existing policies directly.
Select the box next to the
nyctaxi-s3-ingest-policy
policy, and then click Next: Tags.Click Next: Review.
Click Create user.
Copy the Access key ID and Secret access key values that appear to a secure location, as you will need them to get the AWS STS session token.
Get the AWS STS session token: on your local development machine, use the AWS CLI to create a named profile with the AWS credentials that you just copied. See Named profiles for the AWS CLI on the AWS website. This tutorial uses a fictitious named profile named
nyctaxi-s3-ingest-profile
.Test your AWS credentials. To do this, use the AWS CLI to run the following command, which displays the contents of the
nyctaxi
folder. In the command, replace:nyctaxi-sample-data
with the name of your S3 bucket.nyctaxi
with the name of the folder within your S3 bucket.nyctaxi-s3-ingest-profile
with the name of your named profile.
aws s3 ls s3://nyctaxi-sample-data/nyctaxi/ --profile nyctaxi-s3-ingest-profile
To get the session token, run the following command. In the command, replace
nyctaxi-s3-ingest-profile
with the name of your named profile.aws sts get-session-token --profile nyctaxi-s3-ingest-profile
Copy the AccessKeyId, SecretAccessKey, and SessionToken values that appear to a secure location, as you will need them in Step 5.
Step 4: Create the table
In this step, you create a table in your Databricks workspace to hold the incoming data.
To create this table, you can use the SQL editor or the notebook that you used earlier.
To use the SQL editor:
In the sidebar, click Create > Query.
In the SQL editor’s menu bar, select the SQL endpoint that you created in the Requirements section, or select another available SQL endpoint that you want to use.
In the SQL editor, paste the following code:
CREATE TABLE default.nyctaxi_trips ( tpep_pickup_datetime TIMESTAMP, tpep_dropoff_datetime TIMESTAMP, trip_distance DOUBLE, fare_amount DOUBLE, pickup_zip INT, dropoff_zip INT );
Click Run.
To use the notebook:
In the notebook’s second cell, paste the following code. (To add a cell, hover your mouse pointer on the bottom center of the first cell, and then click the + (plus) icon.)
CREATE TABLE default.nyctaxi_trips ( tpep_pickup_datetime TIMESTAMP, tpep_dropoff_datetime TIMESTAMP, trip_distance DOUBLE, fare_amount DOUBLE, pickup_zip INT, dropoff_zip INT );
Run only the second cell. (To run a cell, with your cursor inside of that cell, press Shift+Enter.)
Step 5: Load the sample data from cloud storage into the table
In this step, you load the CSV file from the S3 bucket into the table in your Databricks workspace.
To load the CSV file, you can use the SQL editor or the notebook that you used earlier.
To use the SQL editor:
In the sidebar, click Create > Query.
In the SQL editor’s menu bar, select the SQL endpoint that you created in the Requirements section, or select another available SQL endpoint that you want to use.
In the SQL editor, paste the following code. In this code, replace:
nyctaxi-sample-data
with the name of your S3 bucket.nyctaxi
with the name of the folder in your S3 bucket.<your-AccessKeyId>
with the value ofAccessKeyId
from Step 3.<your-SecretAccessKey>
with the value ofSecretAccessKey
from Step 3.<your-SessionToken>
with the value ofSessionToken
from Step 3.
COPY INTO default.nyctaxi_trips FROM 's3://nyctaxi-sample-data/nyctaxi/' WITH ( CREDENTIAL ( AWS_ACCESS_KEY = '<your-AccessKeyId>', AWS_SECRET_KEY = '<your-SecretAccessKey>', AWS_SESSION_TOKEN = '<your-SessionToken>' ) ) FILEFORMAT = CSV FORMAT_OPTIONS ( 'header' = 'true', 'inferSchema' = 'true' ); SELECT * FROM default.nyctaxi_trips;
Note
FORMAT_OPTIONS
differs depending onFILEFORMAT
. In this case, theheader
option instructs Databricks to treat the first row of the CSV file as a header, and theinferSchema
options instructs Databricks to automatically determine the data type of each field in the CSV file.Click Run.
Note
If you click Run again, no new data is loaded into the table. This is because the
COPY INTO
command only processes what it considers to be new data.
To use the notebook:
In the notebook’s third cell, paste the following code. Replace:
nyctaxi-sample-data
with the name of your S3 bucket.nyctaxi
with the name of the folder within your S3 bucket.<your-AccessKeyId>
with the value ofAccessKeyId
from Step 3.<your-SecretAccessKey>
with the value ofSecretAccessKey
from Step 3.<your-SessionToken>
with the value ofSessionToken
from Step 3.
COPY INTO default.nyctaxi_trips FROM 's3://nyctaxi-sample-data/nyctaxi/' WITH ( CREDENTIAL ( AWS_ACCESS_KEY = '<your-AccessKeyId>', AWS_SECRET_KEY = '<your-SecretAccessKey>', AWS_SESSION_TOKEN = '<your-SessionToken>' ) ) FILEFORMAT = CSV FORMAT_OPTIONS ( 'header' = 'true', 'inferSchema' = 'true' ); SELECT * FROM default.nyctaxi_trips;
Note
FORMAT_OPTIONS
differs byFILEFORMAT
. In this case, theheader
option instructs Databricks to treat the first row of the CSV file as a header, and theinferSchema
options instructs Databricks to automatically determine the data type of each field in the CSV file.Run only the third cell.
Note
If you run this cell again, no new data is loaded into the table. This is because the
COPY INTO
command only processes what it considers to be new data.
Step 6: Clean up
When you are done with this tutorial, you can clean up the associated resources in your cloud account and Databricks if you no longer want to keep them.
Delete the AWS CLI named profile
In your ~/.aws/credentials
file for Unix, Linux, and macOS, or in your %USERPROFILE%\.aws\credentials
file for Windows, remove the following portion of the file, and then save the file:
[nyctaxi-s3-ingest-profile]
aws_access_key_id = <your-access-key-id>
aws_secret_access_key = <your-secret-access-key>
Delete the IAM user
Open the IAM console in your AWS account, typically at https://console.aws.amazon.com/iam.
In the sidebar, click Users.
Select the box next to nyctaxi-s3-ingest-user, and then click Delete.
Enter
nyctaxi-s3-ingest-user
, and then click Delete.
Delete the IAM policy
Open the IAM console in your AWS account, if it is not already open, typically at https://console.aws.amazon.com/iam.
In the sidebar, click Policies.
Select the option next to nyctaxi-s3-ingest-user, and then click Actions > Delete.
Enter
nyctaxi-s3-ingest-policy
, and then click Delete.
Delete the S3 bucket
Open the Amazon S3 console in your AWS account, typically at https://console.aws.amazon.com/s3.
Select the option next to nyctaxi-sample-data, and then click Empty.
Enter
permanently delete
, and then click Empty.In the sidebar, click Buckets.
Select the option next to nyctaxi-sample-data, and then click Delete.
Enter
nyctaxi-sample-data
, and then click Delete bucket.
Delete the tables
You can use the SQL editor or the notebook that you used earlier.
To use the SQL editor:
In the sidebar, click Create > Query.
In the SQL editor’s menu bar, select the SQL endpoint that you created in the Requirements section, or select another available SQL endpoint that you want to use.
In the SQL editor, paste the following code:
DROP TABLE default.nyctaxi_trips;
Click Run.
In the SQL editor’s menu bar, hover over the tab for this query, and then click the X icon.
To use the notebook:
In the notebook’s fifth cell, paste the following code:
DROP TABLE default.nyctaxi_trips;
Run only the fifth cell.
Delete the queries in the SQL editor
If you used the SQL editor for this tutorial, you can delete the queries.
In your Databricks workspace, in the SQL persona, click SQL Editor in the sidebar.
In the SQL editor’s menu bar, hover over the tab for each query that you created for this tutorial, and then click the X icon.
Delete the notebook
If you used a notebook for this tutorial, you can delete the notebook.
In your Databricks workspace, in the Data Science & Engineering or Machine Learning persona, click Workspace > Users > your user name in the sidebar.
Click the drop-down arrow next to NYC Taxi COPY INTO Example, and then click Move to Trash.
Stop the SQL endpoint
If you used a SQL endpoint for this tutorial, and you are not using the SQL endpoint for any other tasks, you should stop the SQL endpoint to avoid additional costs.
In the SQL persona, on the sidebar, click SQL Endpoints.
Next to the name of the SQL endpoint, click Stop.
When prompted, click Stop again.
Stop the cluster
If you used a notebook for this tutorial, and you are not using the cluster that was associated with the notebook for any other tasks, you should stop the cluster to avoid additional costs.
In the Data Science & Engineering or Machine Learning persona, on the sidebar, click Compute.
Click the cluster’s name.
Click Terminate.
Click Confirm.
Additional resources
The COPY INTO reference article