Set up incremental ingestion from Amazon S3

This article describes how to set up incremental data ingestion from Amazon S3 into Databricks. You'll learn how to securely access source data in a cloud object storage location that corresponds with a Unity Catalog volume (recommended) or a Unity Catalog external location. Then, you'll learn how to ingest the data incrementally into a Unity Catalog managed table using Auto Loader with Lakeflow pipelines.

note

To set up incremental ingestion in Databricks SQL instead of in a notebook, see Use standalone streaming tables.

Before you begin

If you're not an admin, this article assumes that an admin has provided you with the following:

Access to a Databricks workspace with Unity Catalog enabled. For more information, see Get started with Unity Catalog.
The READ VOLUME permission on the Unity Catalog external volume or the READ FILES permission on the Unity Catalog external location that corresponds with the cloud storage location that contains your source data. For more information, see Grant permissions on an external location
The path to your source data.

Volume path example: /Volumes/<catalog>/<schema>/<volume>/<path>/<folder>

External location path example: s3://<bucket>/<folder>/
The USE SCHEMA and CREATE TABLE privileges on the schema you want to load data into.
Cluster creation permission or access to a cluster policy that defines a pipeline cluster (cluster_type field set to dlt).

If the path to your source data is a volume path, your cluster must run Databricks Runtime 13.3 LTS or above.

important

If you have questions about these prerequisites, contact your account admin.

Step 1: Create a cluster

To create a cluster, do the following:

Sign in to your Databricks workspace.
In the sidebar, click New > Cluster.
In the clusters UI, specify a unique name for your cluster.
If the path to your source data is a volume path, for Databricks Runtime runtime version, select 13.2 or above.
Click Create cluster.

Step 2: Create a data exploration notebook

This section describes how to create a data exploration notebook so you can understand your data before you create your data pipeline.

In the sidebar, click +New > Notebook.

The notebook is automatically attached to the last cluster you used (in this case, the cluster you created in Step 1: Create a cluster).
Enter a name for the notebook.
Click the language button, and then select Python or SQL from the dropdown menu. Python is selected by default.
To confirm data access to your source data in S3, paste the following code into a notebook cell, click , and then click Run Cell.
- SQL
- Python
SQL
LIST '<path-to-source-data>'
Python
%fs ls '<path-to-source-data>'
Replace <path-to-source-data> with the path to the directory that contains your data.

This displays the contents of the directory that contains the dataset.
To view a sample of the records to better understand the contents and format of each record, paste the following into a notebook cell, click , and then click Run Cell.
- SQL
- Python
SQL
SELECT * from read_files('<path-to-source-data>', format => '<file-format>') LIMIT 10
Python
spark.read.format('<file-format>').load('<path-to-source-data>').limit(10).display()
Replace the following values:
- <file-format>: A supported file format. See DataFrameReader options.
- <path to source data>: The path to a file in the directory that contains your data.
This displays the first ten records from the specified file.

Step 3: Ingest raw data

To ingest raw data, do the following:

In the sidebar, click New > Notebook.

The notebook is automatically attached to the last cluster you used (in this case, the cluster you created earlier in this article).
Enter a name for the notebook.
Click the language button, and then select Python or SQL from the dropdown menu. Python is selected by default.

Paste the following code into a notebook cell:

SQL
Python

SQL
CREATE OR REFRESH STREAMING TABLE
  <table-name>
AS SELECT
  *
FROM
  STREAM read_files(
    '<path-to-source-data>',
    format => '<file-format>'
  )

Python
@dp.table(table_properties={'quality': 'bronze'})
def <table-name>():
  return (
     spark.readStream.format('cloudFiles')
     .option('cloudFiles.format', '<file-format>')
     .load(f'{<path-to-source-data>}')
 )

Replace the following values:

<table-name>: A name for the table that will contain the ingested records.
<path-to-source-data>: The path to your source data.
<file-format>: A supported file format. See DataFrameReader options.

note

Lakeflow pipelines aren't designed to run interactively in notebook cells. Running a cell that contains Lakeflow pipelines syntax in a notebook returns a message about whether the query is syntactically valid, but does not run query logic. The following step describes how to create a pipeline from the ingestion notebook you just created.

Step 4: Create and publish a pipeline

To create a pipeline and publish it to Unity Catalog, do the following:

In your workspace, click Jobs & Pipelines in the sidebar.
Under New, click ETL pipeline.
Enter a name for your pipeline.
For Pipeline mode, select Triggered.
For Source code, select the notebook that contains your pipeline source code.
For Destination, select Unity Catalog.
To ensure that your table is managed by Unity Catalog and any user with access to the parent schema can query it, select a Catalog and a Target schema from the drop-down lists.
If you don't have cluster creation permission, select a Cluster policy that supports Lakeflow pipelines from the drop-down list.
For Advanced, set the Channel to Preview.
Accept all other default values and click Create.

Step 5: Schedule the pipeline

To schedule the pipeline, do the following:

In your workspace, click Jobs & Pipelines in the sidebar.
Click the name of the pipeline you want to schedule.
Click Schedule > Add a schedule.
For Job name, enter a name for the job.
Set the Schedule to Scheduled.
Specify the period, starting time, and time zone.
Configure one or more email addresses to receive alerts on pipeline start, success, or failure.
Click Create.

Next steps

Grant users access to the new table. For more information, see Unity Catalog privileges reference.
Users with access to the new table can now query the table in a notebook or use the Databricks SQL editor.

Before you begin​

Step 1: Create a cluster​

Step 2: Create a data exploration notebook​

Step 3: Ingest raw data​

Step 4: Create and publish a pipeline​

Step 5: Schedule the pipeline​

Next steps​