Migrate to Auto Loader with file events
If you have existing Auto Loader streams that discover files using directory listing or legacy notifications, you can migrate them to Auto Loader with file events.
Migrate from directory listing
To migrate an Auto Loader stream using directory listing to file events:
- Confirm that the prerequisites for file events are satisfied.
- Confirm that your load path is in an external location with file events enabled and that file events work as expected.
- Modify your stream code to set
cloudFiles.useManagedFileEventstotrue. Continue using the same checkpoint location. - Drop any unsupported settings from your stream code.
- Restart your stream. On the first run with file events enabled, Auto Loader performs a directory listing to get current with the file events cache (secure a valid read position in the cache and store it in the stream’s checkpoint). Subsequent runs read directly from the file events cache. See Auto Loader with file events overview.
To migrate an Auto Loader stream from file events back to directory listing:
- Remove the option
cloudFiles.useManagedFileEventsfrom your stream code. - Restart your stream.
Migrate from legacy file notifications
Source data in S3
S3 does not allow event notification configurations with overlapping prefixes. If your source data is in S3, you must first tear down existing event notification configurations.
Legacy file notifications to file events (S3)
To migrate an Auto Loader stream consuming data from S3 using legacy file notifications to file events:
- Before you enable file events on your external locations, stop your Auto Loader stream and tear down the associated notification resources. You can use the
tearDownNotificationResourcesAPI ofCloudFilesAWSResourceManager, as described in Manually configure or manage file notification resources. - Confirm that the prerequisites for file events are satisfied.
- Confirm that your load path is in an external location with file events enabled and that file events work as expected.
- Modify your stream code to set
cloudFiles.useManagedFileEventstotrue. Continue using the same checkpoint location. - Remove unsupported settings from your stream code.
- Remove cloud-specific notifications options (such as
cloudFiles.queueUrl,databricks.serviceCredential, orcloudFiles.awsAccessKey) from your stream code. - Restart your stream. On the first run with file events enabled, Auto Loader performs a directory listing to get current with the file events cache (secure a valid read position in the cache and store it in the stream’s checkpoint). Subsequent runs read directly from the file events cache. See Auto Loader with file events overview.
File events to legacy file notifications (S3)
To migrate an Auto Loader stream consuming data from S3 using file events to legacy file notifications:
- Stop your Auto Loader stream and turn off file events for the external location using the external locations UI. S3 does not allow event notification configurations with overlapping prefixes. If you provided a queue URL when you set up file events, Databricks recommends tearing down your previous setup and creating a new queue setup to avoid missing files.
- Remove the option
cloudFiles.useManagedFileEventsfrom your stream code. - Set the option
cloudFiles.useNotificationstotrue. - Add cloud-specific notifications options (such as
cloudFiles.queueUrl,databricks.serviceCredential, orcloudFiles.awsAccessKey) that are required for Auto Loader to authenticate to your cloud, set up notification resources, and read from the queue. - Start your stream. If you provided the
cloudFiles.queueUrloption (provided a pre-configured queue), Auto Loader starts discovering files using the queue. It also does a one-off directory listing to make sure that no files were missed during the migration. If you did not provide a queue, Auto Loader attempts to create all resources needed for notifications. If you see a failure when you restart the Auto Loader stream, Databricks might not have completed deleting the notification resources it created for file events. Retry in a few minutes.
Source data in Azure Storage or Google Cloud Storage
Legacy file notifications to file events (Azure, GCP)
To migrate an Auto Loader stream consuming data from Azure or GCP using legacy file notifications to file events:
- Confirm that the prerequisites for file events are satisfied.
- Confirm that your load path is in an external location with file events enabled and that file events work as expected.
- Modify your stream code to set
cloudFiles.useManagedFileEventstotrue. Continue using the same checkpoint location. - Remove unsupported settings from your stream code.
- Remove cloud-specific notifications options (such as
cloudFiles.queueName,cloudFiles.subscription,databricks.serviceCredential,cloudFiles.privateKey, orcloudFiles.clientSecret) from your stream code. - Restart your stream. On the first run with file events enabled, Auto Loader will perform a directory listing to get current with the file events cache (secure a valid read position in the cache and store it in the stream’s checkpoint). Subsequent runs read directly from the file events cache. See Auto Loader with file events overview.
- Remove notification resources that were created by Auto Loader when it ran in the legacy file notifications mode. You can use the
tearDownNotificationResourcesAPI ofCloudFilesAWSResourceManager, as described in Manually configure or manage file notification resources.
File events to legacy file notifications (Azure, GCP)
To migrate an Auto Loader stream consuming data from Azure or GCP using file events to legacy file notifications:
- Remove the option
cloudFiles.useManagedFileEventsfrom your stream code. - Set the option
cloudFiles.useNotificationstotrue. - Add cloud-specific notifications options (such as
cloudFiles.queueName,cloudFiles.subscription,databricks.serviceCredential,cloudFiles.privateKey, orcloudFiles.clientSecret) that are required for Auto Loader to authenticate to your cloud, set up notification resources, and read from the queue. - Restart your stream. If you provided the
cloudFiles.queueNameorcloudFiles.subscriptionoptions (provided a pre-configured queue), Auto Loader starts discovering files using the queue. It also does a one-off directory listing to make sure that no files were missed during the migration. If you did not provide a queue, Auto Loader attempts to create all resources needed for notifications. If you see a failure when you restart the Auto Loader stream, Databricks might not have completed deleting the notification resources it created for file events. Retry in a few minutes.