What is Auto Loader?

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup.

How does Auto Loader work?

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. It provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory. Auto Loader has support for both Python and SQL in Lakeflow pipelines.

You can use Auto Loader to process billions of files to migrate or backfill a table. Auto Loader scales to support near real-time ingestion of millions of files per hour.

Supported Auto Loader sources

Auto Loader can load data files from the following sources:

Amazon S3 (s3://)
Azure Data Lake Storage (ADLS, abfss://)
Google Cloud Storage (GCS, gs://)
Unity Catalog volumes (/Volumes/)
Azure Blob Storage (wasbs://)

note
The legacy Windows Azure Storage Blob driver (WASB) has been deprecated. ABFS has numerous benefits over WASB. See Azure documentation on ABFS. For documentation for working with the legacy WASB driver, see Connect to Azure Blob Storage with WASB (legacy).

Auto Loader can ingest JSON, CSV, XML, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats. Auto Loader also supports reading pre-compressed files in these formats. For supported compression types by format, see Data format options.

How does Auto Loader track ingestion progress?

As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint location of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once.

In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don't need to maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics.

Incremental ingestion using Auto Loader with Lakeflow pipelines

Databricks recommends Auto Loader in Lakeflow pipelines for incremental data ingestion. You do not need to provide a schema or checkpoint location because Lakeflow pipelines automatically manage these settings for your pipelines. See Configure Auto Loader for production workloads for recommended configuration.

Databricks also recommends Auto Loader whenever you use Apache Spark Structured Streaming to ingest data from cloud object storage. APIs are available in Python and Scala.

Get started with Databricks Auto Loader

See the following articles to get started configuring incremental data ingestion using Auto Loader with Lakeflow pipelines:

Load data from cloud object storage into streaming tables using Auto Loader (Databricks SQL Editor)

Examples: Common Auto Loader patterns

For examples of common Auto Loader patterns, see Common data loading patterns.

Configure Auto Loader options

For a complete reference of the configuration options that control how Auto Loader reads and processes files, see Auto Loader.

Customize Auto Loader

You can tune Auto Loader based on data volume, variety, and velocity.

Configure schema inference and evolution in Auto Loader: Configure how Auto Loader infers and evolves the schema of your data over time, including handling new columns and type changes.
Automatic type widening with Auto Loader
Configure Auto Loader for production workloads: Optimize Auto Loader for reliability and performance in production, including checkpointing, error handling, and file retention management.
Source data retention: Automatically archive or delete files after ingestion to reduce storage costs and speed up file discovery.
Monitor and observe Auto Loader: Monitor key metrics, query file-level ingestion state, build observability dashboards, and troubleshoot common issues.

If you encounter unexpected performance, see Auto Loader FAQ.

Configure Auto Loader file detection modes

Auto Loader supports two file detection modes. By default, Auto Loader uses directory listing mode. However, Databricks recommends file notification mode using file events for most workloads. See:

Handle out-of-order data

Auto Loader does not guarantee the order in which files are discovered or processed, regardless of whether you use directory listing or file notification mode. Use the following strategies to design your pipelines to handle out-of-order file arrivals.

Lakeflow pipelines with `AUTO CDC`

If you use Lakeflow pipelines with Auto Loader and AUTO CDC, configure tombstone retention so that deleted records are retained long enough to handle out-of-order file arrivals. Set the pipelines.cdc.tombstoneGCThresholdInSeconds table property on the target streaming table to a value that exceeds the maximum expected delay between event arrival and pipeline execution. The default retention is two days. For details, see create_auto_cdc_flow.

Structured Streaming without Lakeflow pipelines

If you use Apache Spark Structured Streaming directly with Auto Loader (without Lakeflow pipelines), consider the following patterns to handle out-of-order data:

Prefer soft deletes over hard deletes: Track a deleted flag and timestamp instead of removing rows, so that a late-arriving delete does not conflict with earlier records.
Compare timestamps before applying updates: When upserting, compare the incoming record's update timestamp against the target row's current timestamp to avoid overwriting with stale data.

Benefits of Auto Loader over using Structured Streaming directly on files

In Apache Spark, you can read files incrementally using spark.readStream.format(fileFormat).load(directory). Auto Loader provides the following benefits over the file source:

Scalability: Auto Loader can discover billions of files efficiently. Backfills can be performed asynchronously to avoid wasting any compute resources.
Performance: The cost of discovering files with Auto Loader scales with the number of files that are being ingested instead of the number of directories that the files may land in. See Configure Auto Loader streams in directory listing mode.
Schema inference and evolution support: Auto Loader can detect schema drifts, notify you when schema changes happen, and rescue data that would have been otherwise ignored or lost. See How does Auto Loader schema inference work?.
Cost: Auto Loader uses native cloud APIs to get lists of files that exist in storage. In addition, Auto Loader's file notification mode can help reduce your cloud costs further by avoiding directory listing altogether. Auto Loader can automatically set up file notification services on storage to make file discovery much cheaper.

How does Auto Loader work?​

Supported Auto Loader sources​

How does Auto Loader track ingestion progress?​

Incremental ingestion using Auto Loader with Lakeflow pipelines​

Get started with Databricks Auto Loader​

Examples: Common Auto Loader patterns​

Configure Auto Loader options​

Customize Auto Loader​

Configure Auto Loader file detection modes​

Handle out-of-order data​

Lakeflow pipelines with AUTO CDC​

Structured Streaming without Lakeflow pipelines​

Benefits of Auto Loader over using Structured Streaming directly on files​