This article details some of the limitations you might encounter while working with data stored in S3 with Delta Lake on Databricks. The eventually consistent model used in Amazon S3 can lead to potential problems when multiple systems or clusters modify data in the same table simultaneously.
Databricks and Delta Lake support multi-cluster writes by default, meaning that queries writing to a table from multiple clusters at the same time won’t corrupt the table. For Delta tables stored on S3, this guarantee is limited to a single Databricks workspace.
Databricks recommends against using S3 bucket versioning in conjunction with Delta Lake. The Delta Lake protocol provides native table versioning. Enabling S3 bucket versioning can lead to increased storage costs and severe slowdown during commits and reads.
To avoid potential data corruption and data loss issues, Databricks recommends you do not modify the same Delta table stored in S3 from different workspaces.
The following features are not supported when running in this mode:
S3 paths with credentials in a cluster that cannot access AWS Security Token Service
You can disable multi-cluster writes by setting
false. If they are disabled, writes to a single table must originate from a single cluster.
spark.databricks.delta.multiClusterWrites.enabled and modifying the same Delta table from multiple clusters concurrently can lead to data loss or data corruption.
If you are using Delta Lake and you have enabled bucket versioning on the S3 bucket, you have two entities managing table files. Databricks recommends disabling bucket versioning so that the
VACUUM command can effectively remove unused data files.
Why does a table show old data after I delete Delta Lake files with
rm -rf and create a new table in the same location?
Deletes on S3 are only eventually consistent. Thus after deleting a table old versions of the transaction log may still be visible for a while. To avoid this, do not reuse a table path after deleting it. Instead we recommend that you use transactional mechanisms like
overwriteSchema to delete and update tables. See Best practice to replace a table.