Skip to main content

Migrate to Auto Loader with file events

If you have existing Auto Loader streams that discover files using directory listing or legacy notifications, you can migrate them to Auto Loader with file events.

Migrate from directory listing

To migrate an Auto Loader stream using directory listing to file events:

  1. Confirm that the prerequisites for file events are satisfied.
  2. Confirm that your load path is in an external location with file events enabled and that file events work as expected.
  3. Modify your stream code to set cloudFiles.useManagedFileEvents to true. Continue using the same checkpoint location.
  4. Drop any unsupported settings from your stream code.
  5. Restart your stream. On the first run with file events enabled, Auto Loader performs a directory listing to get current with the file events cache (secure a valid read position in the cache and store it in the stream’s checkpoint). Subsequent runs read directly from the file events cache. See Auto Loader with file events overview.

To migrate an Auto Loader stream from file events back to directory listing:

  1. Remove the option cloudFiles.useManagedFileEvents from your stream code.
  2. Restart your stream.

Migrate from legacy file notifications

Source data in S3

S3 does not allow event notification configurations with overlapping prefixes. If your source data is in S3, you must first tear down existing event notification configurations.

Legacy file notifications to file events (S3)

To migrate an Auto Loader stream consuming data from S3 using legacy file notifications to file events:

  1. Before you enable file events on your external locations, stop your Auto Loader stream and tear down the associated notification resources. You can use the tearDownNotificationResources API of CloudFilesAWSResourceManager, as described in Manually configure or manage file notification resources.
  2. Confirm that the prerequisites for file events are satisfied.
  3. Confirm that your load path is in an external location with file events enabled and that file events work as expected.
  4. Modify your stream code to set cloudFiles.useManagedFileEvents to true. Continue using the same checkpoint location.
  5. Remove unsupported settings from your stream code.
  6. Remove cloud-specific notifications options (such as cloudFiles.queueUrl, databricks.serviceCredential, or cloudFiles.awsAccessKey) from your stream code.
  7. Restart your stream. On the first run with file events enabled, Auto Loader performs a directory listing to get current with the file events cache (secure a valid read position in the cache and store it in the stream’s checkpoint). Subsequent runs read directly from the file events cache. See Auto Loader with file events overview.

File events to legacy file notifications (S3)

To migrate an Auto Loader stream consuming data from S3 using file events to legacy file notifications:

  1. Stop your Auto Loader stream and turn off file events for the external location using the external locations UI. S3 does not allow event notification configurations with overlapping prefixes. If you provided a queue URL when you set up file events, Databricks recommends tearing down your previous setup and creating a new queue setup to avoid missing files.
  2. Remove the option cloudFiles.useManagedFileEvents from your stream code.
  3. Set the option cloudFiles.useNotifications to true.
  4. Add cloud-specific notifications options (such as cloudFiles.queueUrl, databricks.serviceCredential, or cloudFiles.awsAccessKey) that are required for Auto Loader to authenticate to your cloud, set up notification resources, and read from the queue.
  5. Start your stream. If you provided the cloudFiles.queueUrl option (provided a pre-configured queue), Auto Loader starts discovering files using the queue. It also does a one-off directory listing to make sure that no files were missed during the migration. If you did not provide a queue, Auto Loader attempts to create all resources needed for notifications. If you see a failure when you restart the Auto Loader stream, Databricks might not have completed deleting the notification resources it created for file events. Retry in a few minutes.

Source data in Azure Storage or Google Cloud Storage

Legacy file notifications to file events (Azure, GCP)

To migrate an Auto Loader stream consuming data from Azure or GCP using legacy file notifications to file events:

  1. Confirm that the prerequisites for file events are satisfied.
  2. Confirm that your load path is in an external location with file events enabled and that file events work as expected.
  3. Modify your stream code to set cloudFiles.useManagedFileEvents to true. Continue using the same checkpoint location.
  4. Remove unsupported settings from your stream code.
  5. Remove cloud-specific notifications options (such as cloudFiles.queueName, cloudFiles.subscription, databricks.serviceCredential, cloudFiles.privateKey, or cloudFiles.clientSecret) from your stream code.
  6. Restart your stream. On the first run with file events enabled, Auto Loader will perform a directory listing to get current with the file events cache (secure a valid read position in the cache and store it in the stream’s checkpoint). Subsequent runs read directly from the file events cache. See Auto Loader with file events overview.
  7. Remove notification resources that were created by Auto Loader when it ran in the legacy file notifications mode. You can use the tearDownNotificationResources API of CloudFilesAWSResourceManager, as described in Manually configure or manage file notification resources.

File events to legacy file notifications (Azure, GCP)

To migrate an Auto Loader stream consuming data from Azure or GCP using file events to legacy file notifications:

  1. Remove the option cloudFiles.useManagedFileEvents from your stream code.
  2. Set the option cloudFiles.useNotifications to true.
  3. Add cloud-specific notifications options (such as cloudFiles.queueName, cloudFiles.subscription, databricks.serviceCredential, cloudFiles.privateKey, or cloudFiles.clientSecret) that are required for Auto Loader to authenticate to your cloud, set up notification resources, and read from the queue.
  4. Restart your stream. If you provided the cloudFiles.queueName or cloudFiles.subscription options (provided a pre-configured queue), Auto Loader starts discovering files using the queue. It also does a one-off directory listing to make sure that no files were missed during the migration. If you did not provide a queue, Auto Loader attempts to create all resources needed for notifications. If you see a failure when you restart the Auto Loader stream, Databricks might not have completed deleting the notification resources it created for file events. Retry in a few minutes.