Skip to main content

Trigger jobs when new files arrive

You can use file arrival triggers to trigger a run of your Databricks job when new files arrive in an external location such as Amazon S3, Azure storage, or Google Cloud Storage. You can use this feature when a scheduled job might be inefficient because new data arrives on an irregular schedule.

File arrival triggers make a best effort to check for new files every minute, although this can be affected by the performance of the underlying cloud storage. File arrival triggers do not incur additional costs other than cloud provider costs associated with listing files in the storage location.

A file arrival trigger can be configured to monitor the root of a Unity Catalog external location or volume, or a subpath of an external location or volume. For example, for the Unity Catalog root volume /Volumes/mycatalog/myschema/myvolume/, the following are valid paths for a file arrival trigger:

/Volumes/mycatalog/myschema/myvolume/
/Volumes/mycatalog/myschema/myvolume/mydirectory/

A file arrival trigger recursively checks for new files in all subdirectories of the configured location. For example, if you create a file arrival trigger for the location /Volumes/mycatalog/myschema/myvolume/mydirectory/ and this location has the following subdirectories:

/Volumes/mycatalog/myschema/myvolume/mydirectory/subdirA
/Volumes/mycatalog/myschema/myvolume/mydirectory/subdirB
/Volumes/mycatalog/myschema/myvolume/mydirectory/subdirC/subdirD

The trigger checks for new files in mydirectory, subdirA, subdirB, subdirC, and subdirC/subdirD.

Requirements

The following are required to use file arrival triggers:

Limitations

  • Only new files trigger runs. Overwriting an existing file with a file of the same name does not trigger a run.
  • The path used for a file arrival trigger must not contain external tables or managed locations of catalogs and schemas.
  • The path used for a file arrival trigger cannot contain wildcards, for example, * or ?.
  • If the storage location is configured as an external location in Unity Catalog and that external location is enabled for managed file events:
    • A maximum of 1,000 jobs can be configured with a file arrival trigger in a Databricks workspace.
    • There no limits on the number of files in the storage location.
    • When triggers monitor a subpath of a location, whether an external location volume, the number of changes in the root location can cause the trigger to exceed the allowed time to process the changes. If this happens, the trigger is set into an error state. You can prevented this by configuring the trigger to monitor the root of a location. For example, you can create a Unity Catalog volume at the subpath but configure the trigger on the volume's root.
  • If the storage location is not enabled for file events:
    • A maximum of 50 jobs can be configured with a file arrival trigger on such locations in a Databricks workspace.
    • The storage location can contain up to 10,000 files. If the configured storage location is a subpath of a Unity Catalog external location or volume, the 10,000 file limit applies to the subpath and not the root of the storage location. For example, the root of the storage location can contain more than 10,000 files across its subdirectories, but the configured subdirectory must not exceed the 10,000 file limit.

For additional limitations when managed file events are used as file arrival triggers, see File events limitations.

Comparing file arrival triggers with managed file events and without

In addition to the differences listed in the Limitations section, managed file events add the following differences in behavior when compared to file arrival triggers without file events:

  • When file events are enabled for an external location, Databricks uses a new internal service to track ingestion metadata by processing change notifications from cloud providers. This service retains the metadata for the latest files created or updated for a longer time (for example, a rolling retention of 30 days for the latest million files). This approach enhances the efficiency of file processing.

  • If an existing file is modified and its metadata falls outside the rolling retention period, that modification will be treated as a new file arrival, triggering a job run. You can prevent this by ingesting only immutable files, or you can use file arrival triggers with Auto Loader to track ingestion progress.

Add a file arrival trigger

To add a file arrival trigger to a job:

  1. In the sidebar, click Workflows.
  2. In the Name column on the Jobs tab, click the job name.
  3. In the Job details panel on the right, click Add trigger.
  4. In Trigger type, select File arrival.
  5. In Storage location, enter the URL of the root or a subpath of a Unity Catalog external location or the root or a subpath of a Unity Catalog volume to monitor.
  6. (Optional) Configure advanced options:
    • Minimum time between triggers in seconds: The minimum time to wait to trigger a run after a previous run completes. Files that arrive in this period trigger a run only after the waiting time expires. Use this setting to control the frequency of run creation.
    • Wait after last change in seconds: The time to wait to trigger a run after file arrival. Another file arrival in this period resets the timer. This setting can be used when files arrive in batches, and the whole batch needs to be processed after all files have arrived.
  7. To validate the configuration, click Test connection.
  8. Click Save.

Receive notifications of failed file arrival triggers

To be notified if a file arrival trigger fails to evaluate, configure email or system destination notifications on job failure. See Add notifications on a job.