Configure Auto Loader for production workloads

For comprehensive best practices on setting up Auto Loader, including file discovery mode selection, schema management, and data quality handling, see Auto Loader best practices.

Databricks recommends using Auto Loader in Lakeflow pipelines for incremental data ingestion. Lakeflow pipelines extend functionality in Apache Spark Structured Streaming and allow you to write just a few lines of declarative Python or SQL to deploy a production-quality data pipeline with:

Autoscaling compute infrastructure for cost savings: Optimize Lakeflow pipeline cluster utilization with autoscaling
Data quality checks with expectations: Manage data quality with pipeline expectations
Automatic schema evolution handling: Configure schema inference and evolution in Auto Loader
Monitoring via metrics in the event log: Pipeline event log

Databricks also recommends you follow the streaming best practices for running Auto Loader in production. See Production considerations for Structured Streaming.

note

Lakeflow pipelines are the recommended way to run Auto Loader for most production ingestion. If your workload doesn't have low-latency requirements and your priority is minimizing compute cost, you can instead schedule Auto Loader as a triggered batch job that uses Trigger.AvailableNow. See Cost considerations.

Monitoring Auto Loader

The following sections describe how to monitor Auto Loader in production, including metrics, logs, alerts, and common troubleshooting workflows. For a comprehensive reference covering dashboard patterns, latency analysis, and schema drift detection, see Monitor and observe Auto Loader.

Querying files discovered by Auto Loader

Auto Loader provides a SQL API for inspecting the state of a stream. Using the cloud_files_state function, you can find metadata about files that have been discovered by an Auto Loader stream. Query cloud_files_state, providing the checkpoint location associated with an Auto Loader stream.

note

The cloud_files_state function is available in Databricks Runtime 11.3 LTS and above.

SQL
SELECT * FROM cloud_files_state('path/to/checkpoint');

Listen to stream updates

To further monitor Auto Loader streams, Databricks recommends using Apache Spark's Streaming Query Listener interface. See Monitoring Structured Streaming queries on Databricks.

Auto Loader reports metrics to the Streaming Query Listener at every batch. You can view how many files exist in the backlog and how large the backlog is in the numFilesOutstanding and numBytesOutstanding metrics under the Raw Data tab in the streaming query progress dashboard:

JSON
{
  "sources": [
    {
      "description": "CloudFilesSource[/path/to/source]",
      "metrics": {
        "numFilesOutstanding": "238",
        "numBytesOutstanding": "163939124006"
      }
    }
  ]
}

When using file notification mode in Databricks Runtime 10.4 LTS and above, the metrics also include the approximate number of file events in the cloud queue as approximateQueueSize for AWS and Azure.

Cost considerations

When running Auto Loader, your main sources of cost are compute resources and file discovery.

If your workload doesn't have low-latency requirements, you can reduce compute costs by using Lakeflow Jobs to schedule Auto Loader as batch jobs using Trigger.AvailableNow instead of running it continuously. See Configure Structured Streaming trigger intervals. These batch jobs can be triggered using file arrival triggers to further lower the latency between file arrival and processing.

File discovery costs can come in the form of LIST operations on your storage accounts in directory listing mode and API requests on the subscription service, and queue service in file notification mode. Continuous triggers such as Trigger.ProcessingTime are especially expensive in directory listing mode, since Auto Loader continuously lists the entire directory to find new files. If your workload requires continuous triggers, Databricks recommends choosing a file discovery mode based on your latency requirements:

Low latency and simplicity: Use Auto Loader with file events. File events requires only one queue per bucket and uses incremental discovery on subsequent runs. For more information, see Auto Loader with file events overview.
Very latency-sensitive applications: Use classic file notification mode. Classic mode reads directly from the cloud queue without the additional caching hop introduced by file events. In this mode, you can tag resources created by Auto Loader to track your costs using resource tags. For details, see File notification.

Source data retention

note

Available in Databricks Runtime 16.4 LTS and above.

As files accumulate in your source directory, storage costs increase and file discovery slows down, especially in directory listing mode. Auto Loader provides the cloudFiles.cleanSource option to automatically manage file retention by archiving or deleting files after they are processed.

Archiving files in the source directory to lower costs

warning

Setting cloudFiles.cleanSource deletes or moves files in the source directory.
If you use foreachBatch for your data processing, your files become move or delete candidates as soon as your foreachBatch operation returns successfully even if your operation only consumed a subset of the files in the batch.

Databricks recommends using Auto Loader with file events to reduce discovery costs. This also reduces compute costs because discovery is incremental.

If you cannot use file events and must use directory listing to discover files, you can use the cloudFiles.cleanSource option to automatically archive or delete files after Auto Loader processes them to lower discovery costs. Because Auto Loader cleans up files from your source directory after processing, fewer files need to be listed during discovery.

When using cloudFiles.cleanSource with the MOVE option, consider the following requirements:

Both the source directory and the destination move directory must be located in the same bucket or container. Cross-bucket and cross-container moves are not supported and result in an error.
The move destination can be a volume path (for example, /Volumes/my_catalog/my_schema/my_volume/archive/).
If your source and destination directory are in the same external location, they should not have sibling directories that contain managed storage (for example, a managed volume or catalog). In these cases, Auto Loader is unable to get the necessary permissions to write to the destination directory.

Databricks recommends using this option when:

Your source directory accumulates a large number of files over time.
You must retain processed files for compliance or auditing (set cloudFiles.cleanSource to MOVE).
You want to reduce storage costs by removing files after ingestion (set cloudFiles.cleanSource to DELETE). When using the DELETE mode, Databricks recommends enabling versioning on the bucket so that Auto Loader deletes act as soft-deletes and are available in case of a misconfiguration. Furthermore, Databricks recommends setting up cloud lifecycle policies to purge old, soft-deleted versions after a specified grace period (such as 60 or 90 days) based on your recovery requirements.

For the full reference on cleanSource options and their defaults, see cloudFiles.cleanSource.

Moving processed files to a cold storage path

The following example configures Auto Loader to move processed files to an archive directory within the same bucket after 14 days. You can apply a cloud lifecycle policy on the archive path to transition files to cheaper storage tiers (for example, AWS S3 Glacier, Azure Cool/Archive, or GCS Coldline/Archive).

Python
SQL

Python
# Step 1: Configure Auto Loader to move processed files to an archive path.
checkpoint = "/Volumes/my_catalog/my_schema/my_volume/checkpoints/ingest_stream"
archive_path = "s3://my-bucket/archive/landing/"

df = (spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.cleanSource", "MOVE")
  .option("cloudFiles.cleanSource.moveDestination", archive_path)
  .option("cloudFiles.cleanSource.retentionDuration", "14 days")
  .option("cloudFiles.schemaLocation", checkpoint)
  .load("s3://my-bucket/landing/")
)

# Step 2: Write to a Delta table.
(df.writeStream
  .option("checkpointLocation", checkpoint)
  .trigger(availableNow=True)
  .toTable("my_catalog.my_schema.raw_events")
)

# Step 3 (outside Databricks): Set up a cloud lifecycle policy on the
# archive path to transition files to cold storage after a grace period.
# For example, in AWS you can configure an S3 Lifecycle rule to move
# objects under s3://my-bucket/archive/landing/ to S3 Glacier after
# 30 days.

SQL
-- Step 1: Configure Auto Loader to move processed files to an archive path
-- using a Lakeflow Declarative Pipeline.
CREATE OR REFRESH STREAMING TABLE raw_events
AS SELECT * FROM STREAM read_files(
  's3://my-bucket/landing/',
  format => 'json',
  cleanSource => 'MOVE',
  cleanSourceMoveDestination => 's3://my-bucket/archive/landing/',
  cleanSourceRetentionDuration => '14 days'
);

-- Step 2 (outside Databricks): Set up a cloud lifecycle policy on the
-- archive path to transition files to cold storage.
-- For example, in AWS configure an S3 Lifecycle rule to move objects
-- under s3://my-bucket/archive/landing/ to S3 Glacier after 30 days.

Using Trigger.AvailableNow and rate limiting

note

Available in Databricks Runtime 10.4 LTS and above.

Auto Loader can be scheduled to run in Lakeflow Jobs as a batch job by using Trigger.AvailableNow. The AvailableNow trigger instructs Auto Loader to process all files that arrived before the query start time. New files that arrive after the stream starts are ignored until the next trigger.

With Trigger.AvailableNow, file discovery happens asynchronously with data processing and data can be processed across multiple micro-batches with rate limiting. Auto Loader by default processes a maximum of 1000 files every micro-batch. You can configure cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger to configure how many files or how many bytes should be processed in a micro-batch. The file limit is a hard limit but the byte limit is a soft limit, meaning that more bytes can be processed than the provided maxBytesPerTrigger. When the options are both provided together, Auto Loader processes as many files that are needed to hit one of the limits.

Checkpoint location

The checkpoint location is used to store the state and progress information of the stream. Databricks recommends setting the checkpoint location to a location without a cloud object lifecycle policy. If files in the checkpoint location are cleaned according to the policy, the stream state is corrupted. If this happens, you must restart the stream from scratch.

File event tracking

Auto Loader keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees. For high-volume or long-lived ingestion streams, Databricks recommends upgrading to Databricks Runtime 15.4 LTS or above. In these versions, Auto Loader does not wait for the entire RocksDB state to be downloaded before the stream starts, which can accelerate stream startup time. If you want to prevent the file states from growing without limits, you can also consider using the cloudFiles.maxFileAge option to expire file events that are older than a certain age. The minimum value that you can set for cloudFiles.maxFileAge is "14 days". Deletes in RocksDB appear as tombstone entries. Therefore, you might see the storage usage increase temporarily as events expire before it starts to level off.

warning

cloudFiles.maxFileAge is provided as a cost control mechanism for high volume datasets. Tuning cloudFiles.maxFileAge too aggressively can cause data quality issues such as duplicate ingestion or missing files. Therefore, Databricks recommends a conservative setting for cloudFiles.maxFileAge, such as 90 days, which is similar to what comparable data ingestion solutions recommend.

Trying to tune the cloudFiles.maxFileAge option can lead to unprocessed files being ignored by Auto Loader or already processed files expiring and then being re-processed causing duplicate data. Here are some things to consider when choosing a cloudFiles.maxFileAge:

If your stream restarts after a long time, file notification events that are pulled from the queue that are older than cloudFiles.maxFileAge are ignored. Similarly, if you use directory listing, files that might have appeared during the down time that are older than cloudFiles.maxFileAge are ignored.
If you use directory listing mode and use cloudFiles.maxFileAge, for example set to "1 month", you stop your stream and restart the stream with cloudFiles.maxFileAge set to "2 months", files that are older than 1 month, but more recent than 2 months are reprocessed.

If you set this option the first time you start the stream, you will not ingest data older than cloudFiles.maxFileAge, therefore, if you want to ingest old data you should not set this option as you start your stream for the first time. However, you should set this option on subsequent runs.

Trigger regular backfills using cloudFiles.backfillInterval

In rare instances, files might be missed or late when depending solely on notification systems, such as when notification message retention limits are reached. If you have strict requirements on data completeness and SLA, consider setting cloudFiles.backfillInterval to trigger asynchronous backfills at a specified interval. For example, set it to one day for daily backfills, or one week for weekly backfills. Triggering regular backfills does not cause duplicates.

When using file events, run your stream at least once every 7 days

When using file events, run your Auto Loader streams at least once every 7 days to avoid a full directory listing. Running your Auto Loader streams this frequently will ensure that file discovery is incremental.

For comprehensive managed file events best practices, see Best practices for Auto Loader with file events.

Monitoring Auto Loader​

Querying files discovered by Auto Loader​

Listen to stream updates​

Cost considerations​

Source data retention​

Archiving files in the source directory to lower costs​

Moving processed files to a cold storage path​

Using Trigger.AvailableNow and rate limiting​

Checkpoint location​

File event tracking​

Trigger regular backfills using cloudFiles.backfillInterval​

When using file events, run your stream at least once every 7 days​

Monitoring Auto Loader

Querying files discovered by Auto Loader

Listen to stream updates

Cost considerations

Source data retention

Archiving files in the source directory to lower costs

Moving processed files to a cold storage path

Using Trigger.AvailableNow and rate limiting

Checkpoint location

File event tracking

Trigger regular backfills using cloudFiles.backfillInterval

When using file events, run your stream at least once every 7 days