Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns and provides optimized layouts and indexes for fast interactive queries.
Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.
When writing data, you can specify the location in your cloud storage. Delta Lake stores the data in that location in Parquet format.
If you are using Delta Lake and you have enabled bucket versioning on the S3 bucket, you have two entities managing table files: Delta Lake and AWS. To ensure that data is fully deleted you must:
- Clean up deleted files that are no longer in the Delta Lake transaction log using
- Enable an S3 lifecycle policy for versioned objects that ensures that old versions of deleted files are purged.
Why does a table show old data after I delete Delta Lake files with
rm -rf and create a new table in the same location?
Deletes on S3 are only eventually consistent. Thus after deleting a table old versions of the transaction log may still be visible for a while. To avoid this, do not reuse a table path after deleting it. Instead we recommend that you use transactional mechanisms like
overwriteSchema to delete and update tables. See Best practice to replace a table.
Delta does not support the DStream API. We recommend Table streaming reads and writes.
Yes. When you use Delta Lake, you are using open Apache Spark APIs so you can easily port your code to other Spark platforms. To port your code, replace
delta format with
Delta tables are managed to a greater degree. In particular, there are several Hive SerDe parameters that Delta Lake manages on your behalf that you should never specify manually:
- Unsupported DDL features:
ANALYZE TABLE PARTITION
ALTER TABLE [ADD|DROP] PARTITION
ALTER TABLE RECOVER PARTITIONS
ALTER TABLE SET SERDEPROPERTIES
CREATE TABLE LIKE
INSERT OVERWRITE DIRECTORY
- Unsupported DML features:
INSERT INTO [OVERWRITE]table with static partitions
INSERT OVERWRITE TABLEfor table with dynamic partitions
- Specifying a schema when reading from a table
- Specifying target partitions using
Delta Lake does not support multi-table transactions and foreign keys. Delta Lake supports transactions at the table level.
Changing a column’s type or dropping a column requires rewriting the table. For an example, see Change column type.
It means that Delta Lake does locking to make sure that queries writing to a table from multiple clusters at the same time won’t corrupt the table. However, it does not mean that if there is a write conflict (for example, update and delete the same thing) that they will both succeed. Instead, one of writes will fail atomically and the error will tell you to retry the operation.
The following features are not supported when running in this mode:
- SparkR using Databricks Runtime 7.5 and below. Writing to a Delta table using SparkR in Databricks Runtime 7.6 and above supports multi-cluster writes.
- spark-submit jobs using Databricks Runtime 7.2 and below. Running a spark-submit job using Databricks Runtime 7.3 and above supports multi-cluster writes.
- Server-Side Encryption with Customer-Provided Encryption Keys
- S3 paths with credentials in a cluster that cannot access AWS Security Token Service
You can disable multi-cluster writes by setting
false. If they are disabled, writes to a single table must originate from a single cluster.
You cannot concurrently modify the same Delta table from different workspaces.
The following cases are not recommended as ACID guarantee may be broken and cause data corruption or data loss issues:
- Modify the same Delta table from different workspaces concurrently.
spark.databricks.delta.multiClusterWrites.enabledbut modify the same Delta table from multiple clusters concurrently.
There are two cases to consider: external reads and external writes.
- External reads: Delta tables store data encoded in an open format (Parquet), allowing other tools that understand this format to read the data. For information on how to read Delta tables, see Integrations.
- External writes: Delta Lake maintains additional metadata in a transaction log to enable ACID transactions and snapshot isolation for readers. To ensure the transaction log is updated correctly and the proper validations are performed, writer implementations must strictly adhere to the Delta Transaction Protocol. Delta Lake in Databricks Runtime ensures ACID guarantees based on the Delta Transaction Protocol. Whether non-Spark Delta connectors that write to Delta tables can write with ACID guarantees depends on the connector implementation. For information, see Integrations and the integration-specific documentation on their write guarantees.