Skip to main content

Using Auto Loader with Unity Catalog

Auto Loader can securely ingest data from external locations configured with Unity Catalog. To learn more about securely connecting storage with Unity Catalog, see Connect to cloud object storage using Unity Catalog. Auto Loader relies on Structured Streaming for incremental processing; for recommendations and limitations see Using Unity Catalog with Structured Streaming.

note

In Databricks Runtime 11.3 LTS and above, you can use Auto Loader with either standard or dedicated access modes (formerly shared and single-user access modes).

Directory listing mode is supported by default. File notification mode is only supported on compute with dedicated access mode.

Specify locations for Auto Loader resources for Unity Catalog

The Unity Catalog security model assumes that all storage locations referenced in a workload will be managed by Unity Catalog. Databricks recommends always storing checkpoint and schema evolution information in storage locations managed by Unity Catalog. Unity Catalog does not allow you to nest checkpoint or schema inference and evolution files under the table directory.

Ingest data from cloud storage using Unity Catalog

The following examples assume the executing user has READ FILES permissions on the external location, owner privileges on the target tables, and the following configurations and grants.

Storage location

Grant

gs://autoloader-source/json-data

READ FILES

gs://dev-bucket

READ FILES, WRITE FILES, CREATE TABLE

Use Auto Loader to load to a Unity Catalog managed table

The following examples demonstrate how to use Auto Loader to ingest data to a Unity Catalog managed table.

Python
checkpoint_path = "gs://dev-bucket/_checkpoint/dev_table"

(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load("gs://autoloader-source/json-data")
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(availableNow=True)
.toTable("dev_catalog.dev_database.dev_table"))

Authentication for GCS is handled through the Unity Catalog external location configured for the GCS path. You must have READ FILES privileges on the external location. See Connect to a Google Cloud Storage (GCS) external location.

To use file notification mode for faster file discovery, provide a service credential or Google Service Account credentials. See Google-specific options.