Skip to main content

Using Auto Loader with Unity Catalog

Auto Loader can securely ingest data from external locations configured with Unity Catalog. To learn more about securely connecting storage with Unity Catalog, see Connect to cloud object storage using Unity Catalog. Auto Loader relies on Structured Streaming for incremental processing; for recommendations and limitations see Using Unity Catalog with Structured Streaming.

note

In Databricks Runtime 11.3 LTS and above, you can use Auto Loader with either standard or dedicated access modes (formerly shared and single-user access modes).

Directory listing mode is supported by default. File notification mode is only supported on compute with dedicated access mode.

Specify locations for Auto Loader resources for Unity Catalog

The Unity Catalog security model assumes that all storage locations referenced in a workload will be managed by Unity Catalog. Databricks recommends always storing checkpoint and schema evolution information in storage locations managed by Unity Catalog. Unity Catalog does not allow you to nest checkpoint or schema inference and evolution files under the table directory.

Ingest data from cloud storage using Unity Catalog

The following examples assume the executing user has READ FILES permissions on the external location, owner privileges on the target tables, and the following configurations and grants.

Storage location

Grant

s3://autoloader-source/json-data

READ FILES

s3://dev-bucket

READ FILES, WRITE FILES, CREATE TABLE

Use Auto Loader to load to a Unity Catalog managed table

The following examples demonstrate how to use Auto Loader to ingest data to a Unity Catalog managed table.

Python
checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"

(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load("s3://autoloader-source/json-data")
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(availableNow=True)
.toTable("dev_catalog.dev_database.dev_table"))