Trigger jobs when new files arrive
You can use file arrival triggers to trigger a run of your Databricks job when new files arrive in an external location such as Amazon S3, Azure storage, or Google Cloud Storage. You can use this feature when a scheduled job might be inefficient because new data arrives on an irregular schedule.
File arrival triggers make a best effort to check for new files every minute, although this can be affected by the performance of the underlying cloud storage. File arrival triggers do not incur additional costs other than cloud provider costs associated with listing files in the storage location.
A file arrival trigger can be configured to monitor the root of a Unity Catalog external location or volume, or a subpath of an external location or volume. For example, for the Unity Catalog root volume /Volumes/mycatalog/myschema/myvolume/
, the following are valid paths for a file arrival trigger:
/Volumes/mycatalog/myschema/myvolume/
/Volumes/mycatalog/myschema/myvolume/mydirectory/
A file arrival trigger recursively checks for new files in all subdirectories of the configured location. For example, if you create a file arrival trigger for the location /Volumes/mycatalog/myschema/myvolume/mydirectory/
and this location has the following subdirectories:
/Volumes/mycatalog/myschema/myvolume/mydirectory/subdirA
/Volumes/mycatalog/myschema/myvolume/mydirectory/subdirB
/Volumes/mycatalog/myschema/myvolume/mydirectory/subdirC/subdirD
The trigger checks for new files in mydirectory
, subdirA
, subdirB
, subdirC
, and subdirC/subdirD
.
Requirements
The following are required to use file arrival triggers:
-
The workspace must have Unity Catalog enabled.
-
You must use a storage location that is either a volume or an external location configured in Unity Catalog. See What are Unity Catalog volumes? and Create an external location to connect cloud storage to Databricks.
For optimal performance, the external location should be enabled for file events. Volumes on these external locations get file event support by default. See (Recommended) Enable file events for an external location.
To enable file events for an external location, you must be the external location owner or have the
MANAGE
privilege on the external location. Within minutes of enabling file events on an external location, existing file arrival triggers that monitor paths that are covered by that external location start benefiting from file events enablement. New triggers benefit within seconds.For details about the performance and capacity advantages of enabling file events on external locations, see Limitations. See also Comparing file arrival triggers with managed file events and without.
PreviewManaged file events are in Public Preview.
-
You must have
READ
permission on the storage location and CAN MANAGE permissions on the job. For more information about job permissions, see Job ACLs.
Limitations
- Only new files trigger runs. Overwriting an existing file with a file of the same name does not trigger a run.
- The path used for a file arrival trigger must not contain external tables or managed locations of catalogs and schemas.
- The path used for a file arrival trigger cannot contain wildcards, for example,
*
or?
. - If the storage location is configured as an external location in Unity Catalog and that external location is enabled for managed file events:
- A maximum of 1,000 jobs can be configured with a file arrival trigger in a Databricks workspace.
- There no limits on the number of files in the storage location.
- When triggers monitor a subpath of a location, whether an external location volume, the number of changes in the root location can cause the trigger to exceed the allowed time to process the changes. If this happens, the trigger is set into an error state. You can prevented this by configuring the trigger to monitor the root of a location. For example, you can create a Unity Catalog volume at the subpath but configure the trigger on the volume's root.
- If the storage location is not enabled for file events:
- A maximum of 50 jobs can be configured with a file arrival trigger on such locations in a Databricks workspace.
- The storage location can contain up to 10,000 files. If the configured storage location is a subpath of a Unity Catalog external location or volume, the 10,000 file limit applies to the subpath and not the root of the storage location. For example, the root of the storage location can contain more than 10,000 files across its subdirectories, but the configured subdirectory must not exceed the 10,000 file limit.
For additional limitations when managed file events are used as file arrival triggers, see File events limitations.
Comparing file arrival triggers with managed file events and without
In addition to the differences listed in the Limitations section, managed file events add the following differences in behavior when compared to file arrival triggers without file events:
-
When file events are enabled for an external location, Databricks uses a new internal service to track ingestion metadata by processing change notifications from cloud providers. This service retains the metadata for the latest files created or updated for a longer time (for example, a rolling retention of 30 days for the latest million files). This approach enhances the efficiency of file processing.
-
If an existing file is modified and its metadata falls outside the rolling retention period, that modification will be treated as a new file arrival, triggering a job run. You can prevent this by ingesting only immutable files, or you can use file arrival triggers with Auto Loader to track ingestion progress.
Add a file arrival trigger
To add a file arrival trigger to a job:
- In the sidebar, click Workflows.
- In the Name column on the Jobs tab, click the job name.
- In the Job details panel on the right, click Add trigger.
- In Trigger type, select File arrival.
- In Storage location, enter the URL of the root or a subpath of a Unity Catalog external location or the root or a subpath of a Unity Catalog volume to monitor.
- (Optional) Configure advanced options:
- Minimum time between triggers in seconds: The minimum time to wait to trigger a run after a previous run completes. Files that arrive in this period trigger a run only after the waiting time expires. Use this setting to control the frequency of run creation.
- Wait after last change in seconds: The time to wait to trigger a run after file arrival. Another file arrival in this period resets the timer. This setting can be used when files arrive in batches, and the whole batch needs to be processed after all files have arrived.
- To validate the configuration, click Test connection.
- Click Save.
Receive notifications of failed file arrival triggers
To be notified if a file arrival trigger fails to evaluate, configure email or system destination notifications on job failure. See Add notifications on a job.