Bulk load data into a table with COPY INTO in Databricks SQL

Databricks recommends using the COPY INTO command for incremental and bulk data loading with Databricks SQL.

Note

COPY INTO works well for data sources that contain thousands of files. Databricks recommends that you use Auto Loader for loading millions of files, which is not supported in Databricks SQL.

In this tutorial, you use the COPY INTO command to load data from an Amazon S3 bucket in your AWS account into a table in Databricks SQL.

Requirements

  1. A Databricks SQL endpoint. To create a SQL endpoint, see Create a SQL warehouse.

  2. Familiarity with the Databricks SQL user interface. See the Databricks SQL user guide.

  3. An Amazon S3 bucket in your AWS account. To create a bucket, see Creating a bucket in the AWS documentation.

  4. Access permissions in your AWS account to create policies, users, and access keys for users with AWS Identity and Access Management (IAM).

  5. The AWS Command Line Interface (AWS CLI) installed on your local development machine. See Installing or updating the latest version of the AWS CLI on the AWS website.

  6. The ability to create AWS Security Token Service (AWS STS) session tokens for your AWS account.

Step 1. Prepare the sample data

The COPY INTO command loads data from a supported source into your Databricks workspace. Supported sources include CSV, JSON, Avro, ORC, Parquet, text, and binary files. This source can be anywhere that your Databricks workspace has access to.

This tutorial follows a scenario in which the data is in an Amazon S3 bucket:

  • To set things up, in this step you get a copy of some sample data from Databricks datasets. You then prepare that sample data to be stored in an existing S3 bucket in your AWS account.

  • In the next step, you upload this sample data to the S3 bucket.

  • In the third step, you set up access permissions for the COPY INTO command.

  • Finally, you run the COPY INTO command to load the data from the S3 bucket back into your Databricks workspace.

Normally, you would not export sample data from your Databricks workspace and re-import it. However, this scenario needs to follow this particular workflow to prepare sample data for the COPY INTO command.

To prepare the sample data, you can use the Databricks SQL editor.

  1. In the SQL persona, on the sidebar, click Create > Query.

  2. In the SQL editor’s menu bar, select the SQL warehouse that you created in the Requirements section, or select another available SQL warehouse that you want to use.

  3. In the SQL editor, paste the following code:

    SELECT * FROM samples.nyctaxi.trips
    
  4. Click Run.

  5. At the bottom of the editor, click the ellipses icon, and then click Download as CSV file.

    Note

    This dataset contains almost 22,000 rows of data. This tutorial downloads only the first 1,000 rows of data. To download all of the rows, clear the LIMIT 1000 box and then repeat steps 4-5.

Step 2: Upload the sample data to cloud storage

In this step, you upload the sample data from your local development machine into a folder inside of an S3 bucket in your AWS account.

  1. Sign in to the AWS Management Console, and open the Amazon S3 console in your AWS account, typically at https://console.aws.amazon.com/s3.

  2. Browse to and click on your existing S3 bucket. This tutorial uses a bucket named nyctaxi-sample-data.

  3. Click Create folder.

  4. Enter a name for the folder and click Create folder. This tutorial uses a folder named nyctaxi.

  5. Click the nyctaxi folder.

  6. Click Upload.

  7. Follow the on-screen instructions to upload the CSV file from the previous step into this folder.

Step 3: Create resources in your cloud account to access cloud storage

In this step, in your AWS account you create an IAM user that has just enough access to read the CSV file that you uploaded to S3.

  1. Create the policy for the user: open the AWS IAM console in your AWS account, typically at https://console.aws.amazon.com/iam.

  2. Click Policies.

  3. Click Create Policy.

  4. Click the JSON tab.

  5. Replace the existing JSON code with the following code. In the code, replace:

    • nyctaxi-sample-data with the name of your S3 bucket.

    • nyctaxi with the name of the folder within your S3 bucket.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "ReadOnlyAccessToTrips",
          "Effect": "Allow",
          "Action": [
            "s3:GetObject",
            "s3:ListBucket"
          ],
          "Resource": [
            "arn:aws:s3:::nyctaxi-sample-data",
            "arn:aws:s3:::nyctaxi-sample-data/nyctaxi/*"
          ]
        }
      ]
    }
    
  6. Click Next: Tags.

  7. Click Next: Review.

  8. Enter a name for the policy and click Create policy. This tutorial uses a fictitious policy named nyctaxi-s3-ingest-policy.

  9. Create the user: in the sidebar, click Users.

  10. Click Add users.

  11. Enter a name for the user. This tutorial uses a fictitious user named nyctaxi-s3-ingest-user.

  12. Select the Access key - Programmatic access box, and then click Next: Permissions.

  13. Click Attach existing policies directly.

  14. Select the box next to the nyctaxi-s3-ingest-policy policy, and then click Next: Tags.

  15. Click Next: Review.

  16. Click Create user.

  17. Copy the Access key ID and Secret access key values that appear to a secure location, as you will need them to get the AWS STS session token.

  18. Get the AWS STS session token: on your local development machine, use the AWS CLI to create a named profile with the AWS credentials that you just copied. See Named profiles for the AWS CLI on the AWS website. This tutorial uses a fictitious named profile named nyctaxi-s3-ingest-profile.

  19. Test your AWS credentials. To do this, use the AWS CLI to run the following command, which displays the contents of the nyctaxi folder. In the command, replace:

    • nyctaxi-sample-data with the name of your S3 bucket.

    • nyctaxi with the name of the folder within your S3 bucket.

    • nyctaxi-s3-ingest-profile with the name of your named profile.

    aws s3 ls s3://nyctaxi-sample-data/nyctaxi/ --profile nyctaxi-s3-ingest-profile
    
  20. To get the session token, run the following command. In the command, replace nyctaxi-s3-ingest-profile with the name of your named profile.

    aws sts get-session-token --profile nyctaxi-s3-ingest-profile
    
  21. Copy the AccessKeyId, SecretAccessKey, and SessionToken values that appear to a secure location, as you will need them in Step 5.

Step 4: Create the table

In this step, you create a table in your Databricks workspace to hold the incoming data.

  1. In the sidebar, click Create > Query.

  2. In the SQL editor’s menu bar, select the SQL warehouse that you created in the Requirements section, or select another available SQL warehouse that you want to use.

  3. In the SQL editor, paste the following code:

    CREATE TABLE default.nyctaxi_trips (
      tpep_pickup_datetime  TIMESTAMP,
      tpep_dropoff_datetime TIMESTAMP,
      trip_distance DOUBLE,
      fare_amount DOUBLE,
      pickup_zip INT,
      dropoff_zip INT
    );
    
  4. Click Run.

Step 5: Load the sample data from cloud storage into the table

In this step, you load the CSV file from the S3 bucket into the table in your Databricks workspace.

  1. In the sidebar, click Create > Query.

  2. In the SQL editor’s menu bar, select the SQL warehouse that you created in the Requirements section, or select another available SQL warehouse that you want to use.

  3. In the SQL editor, paste the following code. In this code, replace:

    • nyctaxi-sample-data with the name of your S3 bucket.

    • nyctaxi with the name of the folder in your S3 bucket.

    • <your-AccessKeyId> with the value of AccessKeyId from Step 3.

    • <your-SecretAccessKey> with the value of SecretAccessKey from Step 3.

    • <your-SessionToken> with the value of SessionToken from Step 3.

    COPY INTO default.nyctaxi_trips
    FROM 's3://nyctaxi-sample-data/nyctaxi/'
    WITH (
      CREDENTIAL (
        AWS_ACCESS_KEY = '<your-AccessKeyId>',
        AWS_SECRET_KEY = '<your-SecretAccessKey>',
        AWS_SESSION_TOKEN = '<your-SessionToken>'
      )
    )
    FILEFORMAT = CSV
    FORMAT_OPTIONS (
      'header' = 'true',
      'inferSchema' = 'true'
    );
    
    SELECT * FROM default.nyctaxi_trips;
    

    Note

    FORMAT_OPTIONS differs depending on FILEFORMAT. In this case, the header option instructs Databricks to treat the first row of the CSV file as a header, and the inferSchema options instructs Databricks to automatically determine the data type of each field in the CSV file.

  4. Click Run.

    Note

    If you click Run again, no new data is loaded into the table. This is because the COPY INTO command only processes what it considers to be new data.

Step 6: Clean up

When you are done with this tutorial, you can clean up the associated resources in your cloud account and Databricks if you no longer want to keep them.

Delete the AWS CLI named profile

In your ~/.aws/credentials file for Unix, Linux, and macOS, or in your %USERPROFILE%\.aws\credentials file for Windows, remove the following portion of the file, and then save the file:

[nyctaxi-s3-ingest-profile]
aws_access_key_id = <your-access-key-id>
aws_secret_access_key = <your-secret-access-key>

Delete the IAM user

  1. Open the IAM console in your AWS account, typically at https://console.aws.amazon.com/iam.

  2. In the sidebar, click Users.

  3. Select the box next to nyctaxi-s3-ingest-user, and then click Delete.

  4. Enter nyctaxi-s3-ingest-user, and then click Delete.

Delete the IAM policy

  1. Open the IAM console in your AWS account, if it is not already open, typically at https://console.aws.amazon.com/iam.

  2. In the sidebar, click Policies.

  3. Select the option next to nyctaxi-s3-ingest-user, and then click Actions > Delete.

  4. Enter nyctaxi-s3-ingest-policy, and then click Delete.

Delete the S3 bucket

  1. Open the Amazon S3 console in your AWS account, typically at https://console.aws.amazon.com/s3.

  2. Select the option next to nyctaxi-sample-data, and then click Empty.

  3. Enter permanently delete, and then click Empty.

  4. In the sidebar, click Buckets.

  5. Select the option next to nyctaxi-sample-data, and then click Delete.

  6. Enter nyctaxi-sample-data, and then click Delete bucket.

Delete the tables

  1. In the sidebar, click Create > Query.

  2. Select the SQL warehouse that you created in the Requirements section, or select another available SQL warehouse that you want to use.

  3. Paste the following code:

    DROP TABLE default.nyctaxi_trips;
    
  4. Click Run.

  5. Hover over the tab for this query, and then click the X icon.

Delete the queries in the SQL editor

  1. In your Databricks workspace, in the SQL persona, click SQL Editor in the sidebar.

  2. In the SQL editor’s menu bar, hover over the tab for each query that you created for this tutorial, and then click the X icon.

Stop the SQL warehouse

If you are not using the SQL warehouse for any other tasks, you should stop the SQL warehouse to avoid additional costs.

  1. In the SQL persona, on the sidebar, click SQL Warehouses.

  2. Next to the name of the SQL warehouse, click Stop.

  3. When prompted, click Stop again.

Additional resources