Onboard data from Amazon S3

This article describes how to onboard data to a new Databricks workspace from Amazon S3. You’ll learn how to securely access source data in a cloud object storage location that corresponds with a Unity Catalog volume (recommended) or a Unity Catalog external location. Then, you’ll learn how to ingest the data incrementally into a Unity Catalog managed table using Auto Loader with Delta Live Tables.

Note

To onboard data in Databricks SQL instead of in a notebook, see Load data using streaming tables in Databricks SQL.

Before you begin

If you’re not an admin, this article assumes that an admin has provided you with the following:

  • Access to a Databricks workspace with Unity Catalog enabled. For more information, see Set up and manage Unity Catalog.

  • The READ FILES permission on the Unity Catalog external volume or the Unity Catalog external location that corresponds with the cloud storage location that contains your source data. For more information, see Create an external location to connect cloud storage to Databricks.

  • The path to your source data.

    Volume path example: /Volumes/<catalog>/<schema>/<volume>/<path>/<folder>

    External location path example: s3://<bucket>/<folder>/

  • The USE SCHEMA and CREATE TABLE privileges on the schema you want to load data into.

  • Cluster creation permission or access to a cluster policy that defines a Delta Live Tables pipeline cluster (cluster_type field set to dlt).

    If the path to your source data is a volume path, your cluster must run Databricks Runtime 13.2 or above.

Important

If you have questions about these prerequisites, contact your account admin.

Step 1: Create a cluster

To create a cluster, do the following:

  1. Sign in to your Databricks workspace.

  2. In the sidebar, click New > Cluster.

  3. In the clusters UI, specify a unique name for your cluster.

  4. If the path to your source data is a volume path, for Databricks Runtime runtime version, select 13.2 or above.

  5. Click Create cluster.

Step 2: Create a data exploration notebook

This section describes how to create a data exploration notebook so you can understand your data before you create your data pipeline.

  1. In the sidebar, click +New > Notebook.

    The notebook is automatically attached to the last cluster you used (in this case, the cluster you created in Step 1: Create a cluster).

  2. Enter a name for the notebook.

  3. Click the language button, and then select Python or SQL from the dropdown menu. Python is selected by default.

  4. To confirm data access to your source data in S3, paste the following code into a notebook cell, click Run Menu, and then click Run Cell.

    LIST '<path-to-source-data>'
    
    %fs ls '<path-to-source-data>'
    

    Replace <path-to-source-data> with the path to the directory that contains your data.

    This displays the contents of the directory that contains the dataset.

  5. To view a sample of the records to better understand the contents and format of each record, paste the following into a notebook cell, click Run Menu, and then click Run Cell.

    SELECT * from read_files('<path-to-source-data>', format => '<file-format>') LIMIT 10
    
    spark.read.format('<file-format>').load('<path-to-source-data>').limit(10).display()
    

    Replace the following values:

    • <file-format>: A supported file format. See File format options.

    • <path to source data>: The path to a file in the directory that contains your data.

    This displays the first ten records from the specified file.

Step 3: Ingest raw data

To ingest raw data, do the following:

  1. In the sidebar, click New > Notebook.

    The notebook is automatically attached to the last cluster you used (in this case, the cluster you created earlier in this article).

  2. Enter a name for the notebook.

  3. Click the language button, and then select Python or SQL from the dropdown menu. Python is selected by default.

  4. Paste the following code into a notebook cell:

    CREATE OR REFRESH STREAMING TABLE
      <table-name>
    AS SELECT
      *
    FROM
      STREAM read_files(
        '<path-to-source-data>',
        format => '<file-format>'
      )
    
    @dlt.table(table_properties={'quality': 'bronze'})
    def <table-name>():
      return (
         spark.readStream.format('cloudFiles')
         .option('cloudFiles.format', '<file-format>')
         .load(f'{<path-to-source-data>}')
     )
    

    Replace the following values:

    • <table-name>: A name for the table that will contain the ingested records.

    • <path-to-source-data>: The path to your source data.

    • <file-format>: A supported file format. See File format options.

Note

Delta Live Tables isn’t designed to run interactively in notebook cells. Running a cell that contains Delta Live Tables syntax in a notebook returns a message about whether the query is syntactically valid, but does not run query logic. The following step describes how to create a pipeline from the ingestion notebook you just created.

Step 4: Create and publish a pipeline

To create a pipeline and publish it to Unity Catalog, do the following:

  1. In the sidebar, click Workflows, click the Delta Live Tables tab, and then click Create pipeline.

  2. Enter a name for your pipeline.

  3. For Pipeline mode, select Triggered.

  4. For Source code, select the notebook that contains your pipeline source code.

  5. For Destination, select Unity Catalog.

  6. To ensure that your table is managed by Unity Catalog and any user with access to the parent schema can query it, select a Catalog and a Target schema from the drop-down lists.

  7. If you don’t have cluster creation permission, select a Cluster policy that supports Delta Live Tables from the drop-down list.

  8. For Advanced, set the Channel to Preview.

  9. Accept all other default values and click Create.

Step 5: Schedule the pipeline

To schedule the pipeline, do the following:

  1. In the sidebar, click Delta Live Tables.

  2. Click the name of the pipeline you want to schedule.

  3. Click Schedule > Add a schedule.

  4. For Job name, enter a name for the job.

  5. Set the Schedule to Scheduled.

  6. Specify the period, starting time, and time zone.

  7. Configure one or more email addresses to receive alerts on pipeline start, success, or failure.

  8. Click Create.

Next steps