This article details some of the limitations you might encounter while working with data stored in S3 with Delta Lake on Databricks. The eventually consistent model used in Amazon S3 can lead to potential problems when multiple systems or clusters modify data in the same table simultaneously.
Databricks and Delta Lake support multi-cluster writes by default, meaning that queries writing to a table from multiple clusters at the same time won’t corrupt the table. For Delta tables stored on S3, this guarantee is limited to a single Databricks workspace.
To avoid potential data corruption and data loss issues, Databricks recommends you do not modify the same Delta table stored in S3 from different workspaces.
The following features are not supported when running in this mode:
SparkR using Databricks Runtime 7.5 and below. Writing to a Delta table using SparkR in Databricks Runtime 7.6 and above supports multi-cluster writes.
S3 paths with credentials in a cluster that cannot access AWS Security Token Service
You can disable multi-cluster writes by setting
false. If they are disabled, writes to a single table must originate from a single cluster.
spark.databricks.delta.multiClusterWrites.enabled and modifying the same Delta table from multiple clusters concurrently can lead to data loss or data corruption.
If you are using Delta Lake and you have enabled bucket versioning on the S3 bucket, you have two entities managing table files: Delta Lake and AWS. To ensure that data is fully deleted you must:
Clean up deleted files that are no longer in the Delta Lake transaction log using
Enable an S3 lifecycle policy for versioned objects that ensures that old versions of deleted files are purged.
Why does a table show old data after I delete Delta Lake files with
rm -rf and create a new table in the same location?
Deletes on S3 are only eventually consistent. Thus after deleting a table old versions of the transaction log may still be visible for a while. To avoid this, do not reuse a table path after deleting it. Instead we recommend that you use transactional mechanisms like
overwriteSchema to delete and update tables. See Best practice to replace a table.