Using Auto Loader with Unity Catalog

Auto Loader can securely ingest data from external locations configured with Unity Catalog. To learn more about securely connecting storage with Unity Catalog, see Manage external locations and storage credentials. Auto Loader relies on Structured Streaming for incremental processing; for recommendations and limitations see Using Unity Catalog with Structured Streaming.

Warning

You must launch your cluster with single user access mode to run Auto Loader with Unity Catalog.

Ingesting data from external locations managed by Unity Catalog with Auto Loader

You can use Auto Loader to ingest data from any external location managed by Unity Catalog. You must have READ FILES permissions on the external location.

Note

Unity Catalog external locations do not support cross-cloud or cross-account configurations for Auto Loader.

Directory Listing mode is supported by default. To use File Notification mode, you must configure additional cloud credentials to connect to file notification and queue services; see Choosing between file notification and directory listing modes.

Specifying locations for Auto Loader resources for Unity Catalog

The Unity Catalog security model assumes that all storage locations referenced in a workload will be managed by Unity Catalog. Databricks recommends always storing checkpoint and schema evolution information in storage locations managed by Unity Catalog. Unity Catalog does not allow you to nest checkpoint or schema inference and evolution files under the table directory.

Examples

The follow examples assume the executing user has owner privileges on the target tables and the following configurations and grants:

Storage location

Grant

s3://autoloader-source/json-data

READ FILES

s3://dev-bucket

READ FILES, WRITE FILES, CREATE TABLE

Using Auto Loader to load to a Unity Catalog managed table

checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load("s3://autoloader-source/json-data")
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable("dev_catalog.dev_database.dev_table"))