Load data using COPY INTO with an instance profile

This article describes how to use the COPY INTO command to load data from an Amazon S3 bucket in your AWS account into a table in Databricks SQL.

The steps in this article assume that your admin has configured a SQL warehouse to use an AWS instance profile so that you can access your source files in S3. If your admin configured a Unity Catalog external location with a storage credential, see Load data using COPY INTO with Unity Catalog volumes or external locations instead. If your admin gave you temporary credentials (an AWS access key ID, a secret key, and a session token), see Load data using COPY INTO with temporary credentials instead.

Databricks recommends using the COPY INTO command for incremental and bulk data loading with Databricks SQL.

Note

COPY INTO works well for data sources that contain thousands of files. Databricks recommends that you use Auto Loader for loading millions of files, which is not supported in Databricks SQL.

Before you begin

Before you load data into Databricks, make sure you have the following:

  • Access to data in S3. Your admin must first complete the steps in Configure data access for ingestion so your Databricks SQL warehouse can read your source files.

  • A Databricks SQL warehouse that uses the instance profile that your admin created.

  • The Can manage permission on the SQL warehouse.

  • The fully qualified S3 URI.

  • Familiarity with the Databricks SQL user interface.

Step 1: Confirm access to data in cloud storage

To confirm that you have access to the correct data in cloud object storage, do the following:

  1. In the sidebar, click Create > Query.

  2. In the SQL editor’s menu bar, select a SQL warehouse.

  3. In the SQL editor, paste the following code:

    select * from csv.<path>
    

    Replace <path> with the S3 URI that you received from your admin. For example, s3://<bucket>/<folder>/.

  4. Click Run.

Step 2: Create a table

This step describes how to create a table in your Databricks workspace to hold the incoming data.

  1. In the SQL editor, paste the following code:

    CREATE TABLE <catalog_name>.<schema_name>.<table_name> (
      tpep_pickup_datetime  TIMESTAMP,
      tpep_dropoff_datetime TIMESTAMP,
      trip_distance DOUBLE,
      fare_amount DOUBLE,
      pickup_zip INT,
      dropoff_zip INT
    );
    
  2. Click Run.

Step 3: Load data from cloud storage into the table

This step describes how to load data from an S3 bucket into the table you created in your Databricks workspace.

  1. In the sidebar, click Create > Query.

  2. In the SQL editor’s menu bar, select a SQL warehouse and make sure the SQL warehouse is running.

  3. In the SQL editor, paste the following code. In this code, replace:

    • <s3-bucket> with the name of your S3 bucket.

    • <folder> with the name of the folder in your S3 bucket.

    COPY INTO <catalog-name>.<schema-name>.<table-name>
    FROM 's3://<s3-bucket>/<folder>/'
    FILEFORMAT = CSV
    FORMAT_OPTIONS (
      'header' = 'true',
      'inferSchema' = 'true'
    )
    COPY_OPTIONS (
       'mergeSchema' = 'true'
    );
    
    SELECT * FROM <catalog_name>.<schema_name>.<table_name>;
    

    Note

    FORMAT_OPTIONS differs depending on FILEFORMAT. In this case, the header option instructs Databricks to treat the first row of the CSV file as a header, and the inferSchema options instructs Databricks to automatically determine the data type of each field in the CSV file.

  4. Click Run.

    Note

    If you click Run again, no new data is loaded into the table. This is because the COPY INTO command only processes what it considers to be new data.

Clean up

You can clean up the associated resources in your workspace if you no longer want to keep them.

Delete the tables

  1. In the sidebar, click Create > Query.

  2. Select a SQL warehouse and make sure that the SQL warehouse is running.

  3. Paste the following code:

    DROP TABLE <catalog-name>.<schema-name>.<table-name>;
    
  4. Click Run.

  5. Hover over the tab for this query, and then click the X icon.

Delete the queries in the SQL editor

  1. In the sidebar, click SQL Editor.

  2. In the SQL editor’s menu bar, hover over the tab for each query that you created for this tutorial, and then click the X icon.

Additional resources