Tutorial: Bulk load data into a table with COPY INTO

If you have data in cloud storage that needs to be bulk-loaded into a table in a Databricks workspace, the COPY INTO command is a great option. It supports incremental appends, simple transformations, and many data formats. It is ideal when the source directory contains a small number of files, for example thousands of files or fewer.

Note

Databricks recommends that you use Auto Loader for advanced use cases. Unlike COPY INTO, Auto Loader supports handling millions of files per second, complex transformations, and merging change data.

In this tutorial, you use the COPY INTO command to load data from an Amazon S3 bucket in your AWS account into a table in your Databricks workspace.

Requirements

  1. A Databricks account, and a Databricks workspace within your account. To create these, see Sign up for a free trial.

  2. A Databricks SQL endpoint or an all-purpose cluster within your workspace. To create a SQL endpoint, see Create a SQL endpoint. To create an all-purpose cluster, see Create a cluster.

  3. Familiarity with the Databricks workspace user interface. See Navigate the workspace.

  4. If you want to use Databricks SQL instead of a cluster, familiarity with the Databricks SQL user interface. See the Databricks SQL user guide.

  5. An Amazon S3 bucket in your AWS account. To create a bucket, see Creating a bucket in the AWS documentation.

  6. Access permissions in your AWS account to create policies, users, and access keys for users with AWS Identity and Access Management (IAM).

  7. The AWS Command Line Interface (AWS CLI) installed on your local development machine. See Installing or updating the latest version of the AWS CLI on the AWS website.

  8. The ability to create AWS Security Token Service (AWS STS) session tokens for your AWS account.

Step 1. Prepare the sample data

The COPY INTO command needs data from a supported source to load into your Databricks workspace. Supported sources include CSV, JSON, Avro, ORC, Parquet, text, and binary files. This source can be anywhere that your Databricks workspace has access to.

This tutorial follows a scenario in which the data is in an Amazon S3 bucket:

  • To set things up, in this step you get a copy of some sample data from Databricks datasets. You then prepare that sample data to be stored in an existing S3 bucket in your AWS account.

  • In the next step, you upload this sample data to the S3 bucket.

  • In the third step, you set up access permissions for the COPY INTO command.

  • Finally, you run the COPY INTO command to load the data from the S3 bucket back into your Databricks workspace.

Normally, you would not export sample data from your Databricks workspace and re-import it. However, this scenario needs to follow this particular workflow to prepare sample data for the COPY INTO command.

To prepare the sample data, you can use the Databricks SQL editor or a notebook in your Databricks workspace.

To use the SQL editor:

  1. In the SQL persona, on the sidebar, click Create > Query.

  2. In the SQL editor’s menu bar, select the SQL endpoint that you created in the Requirements section, or select another available SQL endpoint that you want to use.

  3. In the SQL editor, paste the following code:

    SELECT * FROM samples.nyctaxi.trips
    
  4. Click Run.

  5. At the bottom of the editor, click the ellipses icon, and then click Download as CSV file.

    Note

    This dataset contains almost 22,000 rows of data. This tutorial downloads only the first 1,000 rows of data. To download all of the rows, clear the LIMIT 1000 box and then repeat steps 4-5.

To use a notebook:

  1. In the Data Science & Engineering or Machine Learning persona, on the sidebar, click Create > Notebook.

  2. In the Create Notebook dialog, enter a name for the notebook, for example, NYC Taxi COPY INTO Sample.

  3. For Default Language, select SQL.

  4. For Cluster, select the cluster you created in the Requirements section, or select another available cluster that you want to use.

  5. Click Create.

  6. In the notebook’s menu bar, if the circle next to the name of the cluster does not contain a green check mark, click the drop-down arrow next to the cluster’s name, and then click Start Cluster. Click Confirm, and then wait until the circle contains a green check mark.

  7. In the notebook’s first cell, paste the following code:

    SELECT * FROM delta.`/databricks-datasets/nyctaxi-with-zipcodes/subsampled`
    
  8. In the notebook’s menu bar, click Run All.

  9. At the bottom of the cell, click the drop-down arrow next to the download icon, and then click Download preview.

    Note

    This dataset contains almost 22,000 rows of data. This tutorial downloads only the first 1,000 rows of data. To download all of the rows, click Download full results (max 1 million rows), and then click Re-execute and download.

Step 2: Upload the sample data to cloud storage

In this step, you upload the sample data from your local development machine into a folder inside of an S3 bucket in your AWS account.

  1. Sign in to the AWS Management Console, and open the Amazon S3 console in your AWS account, typically at https://console.aws.amazon.com/s3.

  2. Browse to and click on your existing S3 bucket. This tutorial uses a bucket named nyctaxi-sample-data.

  3. Click Create folder.

  4. Enter a name for the folder and click Create folder. This tutorial uses a folder named nyctaxi.

  5. Click the nyctaxi folder.

  6. Click Upload.

  7. Follow the on-screen instructions to upload the CSV file from the previous step into this folder.

Step 3: Create resources in your cloud account to access cloud storage

In this step, in your AWS account you create an IAM user that has just enough access to read the CSV file that you uploaded to S3.

  1. Create the policy for the user: open the AWS IAM console in your AWS account, typically at https://console.aws.amazon.com/iam.

  2. Click Policies.

  3. Click Create Policy.

  4. Click the JSON tab.

  5. Replace the existing JSON code with the following code. In the code, replace:

    • nyctaxi-sample-data with the name of your S3 bucket.

    • nyctaxi with the name of the folder within your S3 bucket.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "ReadOnlyAccessToTrips",
          "Effect": "Allow",
          "Action": [
            "s3:GetObject",
            "s3:ListBucket"
          ],
          "Resource": [
            "arn:aws:s3:::nyctaxi-sample-data",
            "arn:aws:s3:::nyctaxi-sample-data/nyctaxi/*"
          ]
        }
      ]
    }
    
  6. Click Next: Tags.

  7. Click Next: Review.

  8. Enter a name for the policy and click Create policy. This tutorial uses a fictitious policy named nyctaxi-s3-ingest-policy.

  9. Create the user: in the sidebar, click Users.

  10. Click Add users.

  11. Enter a name for the user. This tutorial uses a fictitious user named nyctaxi-s3-ingest-user.

  12. Select the Access key - Programmatic access box, and then click Next: Permissions.

  13. Click Attach existing policies directly.

  14. Select the box next to the nyctaxi-s3-ingest-policy policy, and then click Next: Tags.

  15. Click Next: Review.

  16. Click Create user.

  17. Copy the Access key ID and Secret access key values that appear to a secure location, as you will need them to get the AWS STS session token.

  18. Get the AWS STS session token: on your local development machine, use the AWS CLI to create a named profile with the AWS credentials that you just copied. See Named profiles for the AWS CLI on the AWS website. This tutorial uses a fictitious named profile named nyctaxi-s3-ingest-profile.

  19. Test your AWS credentials. To do this, use the AWS CLI to run the following command, which displays the contents of the nyctaxi folder. In the command, replace:

    • nyctaxi-sample-data with the name of your S3 bucket.

    • nyctaxi with the name of the folder within your S3 bucket.

    • nyctaxi-s3-ingest-profile with the name of your named profile.

    aws s3 ls s3://nyctaxi-sample-data/nyctaxi/ --profile nyctaxi-s3-ingest-profile
    
  20. To get the session token, run the following command. In the command, replace nyctaxi-s3-ingest-profile with the name of your named profile.

    aws sts get-session-token --profile nyctaxi-s3-ingest-profile
    
  21. Copy the AccessKeyId, SecretAccessKey, and SessionToken values that appear to a secure location, as you will need them in Step 5.

Step 4: Create the table

In this step, you create a table in your Databricks workspace to hold the incoming data.

To create this table, you can use the SQL editor or the notebook that you used earlier.

To use the SQL editor:

  1. In the sidebar, click Create > Query.

  2. In the SQL editor’s menu bar, select the SQL endpoint that you created in the Requirements section, or select another available SQL endpoint that you want to use.

  3. In the SQL editor, paste the following code:

    CREATE TABLE default.nyctaxi_trips (
      tpep_pickup_datetime  TIMESTAMP,
      tpep_dropoff_datetime TIMESTAMP,
      trip_distance DOUBLE,
      fare_amount DOUBLE,
      pickup_zip INT,
      dropoff_zip INT
    );
    
  4. Click Run.

To use the notebook:

  1. In the notebook’s second cell, paste the following code. (To add a cell, hover your mouse pointer on the bottom center of the first cell, and then click the + (plus) icon.)

    CREATE TABLE default.nyctaxi_trips (
      tpep_pickup_datetime  TIMESTAMP,
      tpep_dropoff_datetime TIMESTAMP,
      trip_distance DOUBLE,
      fare_amount DOUBLE,
      pickup_zip INT,
      dropoff_zip INT
    );
    
  2. Run only the second cell. (To run a cell, with your cursor inside of that cell, press Shift+Enter.)

Step 5: Load the sample data from cloud storage into the table

In this step, you load the CSV file from the S3 bucket into the table in your Databricks workspace.

To load the CSV file, you can use the SQL editor or the notebook that you used earlier.

To use the SQL editor:

  1. In the sidebar, click Create > Query.

  2. In the SQL editor’s menu bar, select the SQL endpoint that you created in the Requirements section, or select another available SQL endpoint that you want to use.

  3. In the SQL editor, paste the following code. In this code, replace:

    • nyctaxi-sample-data with the name of your S3 bucket.

    • nyctaxi with the name of the folder in your S3 bucket.

    • <your-AccessKeyId> with the value of AccessKeyId from Step 3.

    • <your-SecretAccessKey> with the value of SecretAccessKey from Step 3.

    • <your-SessionToken> with the value of SessionToken from Step 3.

    COPY INTO default.nyctaxi_trips
    FROM 's3://nyctaxi-sample-data/nyctaxi/'
    WITH (
      CREDENTIAL (
        AWS_ACCESS_KEY = '<your-AccessKeyId>',
        AWS_SECRET_KEY = '<your-SecretAccessKey>',
        AWS_SESSION_TOKEN = '<your-SessionToken>'
      )
    )
    FILEFORMAT = CSV
    FORMAT_OPTIONS (
      'header' = 'true',
      'inferSchema' = 'true'
    );
    
    SELECT * FROM default.nyctaxi_trips;
    

    Note

    FORMAT_OPTIONS differs depending on FILEFORMAT. In this case, the header option instructs Databricks to treat the first row of the CSV file as a header, and the inferSchema options instructs Databricks to automatically determine the data type of each field in the CSV file.

  4. Click Run.

    Note

    If you click Run again, no new data is loaded into the table. This is because the COPY INTO command only processes what it considers to be new data.

To use the notebook:

  1. In the notebook’s third cell, paste the following code. Replace:

    • nyctaxi-sample-data with the name of your S3 bucket.

    • nyctaxi with the name of the folder within your S3 bucket.

    • <your-AccessKeyId> with the value of AccessKeyId from Step 3.

    • <your-SecretAccessKey> with the value of SecretAccessKey from Step 3.

    • <your-SessionToken> with the value of SessionToken from Step 3.

    COPY INTO default.nyctaxi_trips
    FROM 's3://nyctaxi-sample-data/nyctaxi/'
    WITH (
      CREDENTIAL (
        AWS_ACCESS_KEY = '<your-AccessKeyId>',
        AWS_SECRET_KEY = '<your-SecretAccessKey>',
        AWS_SESSION_TOKEN = '<your-SessionToken>'
      )
    )
    FILEFORMAT = CSV
    FORMAT_OPTIONS (
      'header' = 'true',
      'inferSchema' = 'true'
    );
    
    SELECT * FROM default.nyctaxi_trips;
    

    Note

    FORMAT_OPTIONS differs by FILEFORMAT. In this case, the header option instructs Databricks to treat the first row of the CSV file as a header, and the inferSchema options instructs Databricks to automatically determine the data type of each field in the CSV file.

  2. Run only the third cell.

    Note

    If you run this cell again, no new data is loaded into the table. This is because the COPY INTO command only processes what it considers to be new data.

Step 6: Clean up

When you are done with this tutorial, you can clean up the associated resources in your cloud account and Databricks if you no longer want to keep them.

Delete the AWS CLI named profile

In your ~/.aws/credentials file for Unix, Linux, and macOS, or in your %USERPROFILE%\.aws\credentials file for Windows, remove the following portion of the file, and then save the file:

[nyctaxi-s3-ingest-profile]
aws_access_key_id = <your-access-key-id>
aws_secret_access_key = <your-secret-access-key>

Delete the IAM user

  1. Open the IAM console in your AWS account, typically at https://console.aws.amazon.com/iam.

  2. In the sidebar, click Users.

  3. Select the box next to nyctaxi-s3-ingest-user, and then click Delete.

  4. Enter nyctaxi-s3-ingest-user, and then click Delete.

Delete the IAM policy

  1. Open the IAM console in your AWS account, if it is not already open, typically at https://console.aws.amazon.com/iam.

  2. In the sidebar, click Policies.

  3. Select the option next to nyctaxi-s3-ingest-user, and then click Actions > Delete.

  4. Enter nyctaxi-s3-ingest-policy, and then click Delete.

Delete the S3 bucket

  1. Open the Amazon S3 console in your AWS account, typically at https://console.aws.amazon.com/s3.

  2. Select the option next to nyctaxi-sample-data, and then click Empty.

  3. Enter permanently delete, and then click Empty.

  4. In the sidebar, click Buckets.

  5. Select the option next to nyctaxi-sample-data, and then click Delete.

  6. Enter nyctaxi-sample-data, and then click Delete bucket.

Delete the tables

You can use the SQL editor or the notebook that you used earlier.

To use the SQL editor:

  1. In the sidebar, click Create > Query.

  2. In the SQL editor’s menu bar, select the SQL endpoint that you created in the Requirements section, or select another available SQL endpoint that you want to use.

  3. In the SQL editor, paste the following code:

    DROP TABLE default.nyctaxi_trips;
    
  4. Click Run.

  5. In the SQL editor’s menu bar, hover over the tab for this query, and then click the X icon.

To use the notebook:

  1. In the notebook’s fifth cell, paste the following code:

    DROP TABLE default.nyctaxi_trips;
    
  2. Run only the fifth cell.

Delete the queries in the SQL editor

If you used the SQL editor for this tutorial, you can delete the queries.

  1. In your Databricks workspace, in the SQL persona, click SQL Editor in the sidebar.

  2. In the SQL editor’s menu bar, hover over the tab for each query that you created for this tutorial, and then click the X icon.

Delete the notebook

If you used a notebook for this tutorial, you can delete the notebook.

  1. In your Databricks workspace, in the Data Science & Engineering or Machine Learning persona, click Workspace > Users > your user name in the sidebar.

  2. Click the drop-down arrow next to NYC Taxi COPY INTO Example, and then click Move to Trash.

Stop the SQL endpoint

If you used a SQL endpoint for this tutorial, and you are not using the SQL endpoint for any other tasks, you should stop the SQL endpoint to avoid additional costs.

  1. In the SQL persona, on the sidebar, click SQL Endpoints.

  2. Next to the name of the SQL endpoint, click Stop.

  3. When prompted, click Stop again.

Stop the cluster

If you used a notebook for this tutorial, and you are not using the cluster that was associated with the notebook for any other tasks, you should stop the cluster to avoid additional costs.

  1. In the Data Science & Engineering or Machine Learning persona, on the sidebar, click Compute.

  2. Click the cluster’s name.

  3. Click Terminate.

  4. Click Confirm.

Additional resources