Frequently Asked Questions (FAQ)

What is Delta Lake?

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns and provides optimized layouts and indexes for fast interactive queries.

What format does Delta Lake use to store data?

Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

How can I read and write data with Delta Lake?

You can use your favorite Apache Spark APIs to read and write data with Delta Lake. See Read a table and Write to a table.

Where does Delta Lake store the data?

When writing data, you can specify the location in your cloud storage. Delta Lake stores the data in that location in Parquet format.

Can I stream data directly into Delta Lake tables?

Yes, you can use Structured Streaming to directly write data into Delta Lake tables. See Stream data into Delta Lake tables.

When I use Delta Lake, will I be able to port my code to other Spark platforms easily?

Yes. When you use Delta Lake, you are using open Apache Spark APIs so you can easily port your code to other Spark platforms. To port your code, replace delta format with parquet format.

How do Delta Lake tables compare to Hive SerDe tables?

Delta Lake tables are managed to a greater degree. In particular, there are several Hive SerDe parameters that Delta Lake manages on your behalf that you should never specify manually:

  • ROWFORMAT
  • SERDE
  • OUTPUTFORMAT AND INPUTFORMAT
  • COMPRESSION
  • STORED AS

Does Delta Lake support multi-table transactions?

Delta Lake does not support multi-table transactions and foreign keys. Delta Lake supports transactions at the table level.

What DDL and DML features does Delta Lake not support?

  • Unsupported DDL features:
    • ANALYZE TABLE PARTITION
    • ALTER TABLE [ADD|DROP] PARTITION
    • ALTER TABLE SET LOCATION
    • ALTER TABLE RECOVER PARTITIONS
    • ALTER TABLE SET SERDEPROPERTIES
    • CREATE TABLE LIKE
    • INSERT OVERWRITE DIRECTORY
    • LOAD DATA
  • Unsupported DML features:
    • INSERT INTO [OVERWRITE] with static partitions.
    • Bucketing.
    • Specifying a schema when reading from a table. A command such as spark.read.format("delta").schema(df.schema).load(path) will fail.
    • Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE.

How can I change the type of a column?

Changing a column’s type or dropping a column requires rewriting the table. For an example, see Change column type. .. end-managed-delta

When should I use partitioning with Delta Lake tables?

You can partition a Delta Lake table by a column. The most commonly used partition column is date. Follow these two rules of thumb for deciding on what column to partition by:

  • Cardinality of the column: Number of distinct values a column will have. If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you partition by a column userId and if there can be 1M distinct user ids, then that is a bad partitioning strategy.
  • Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.

When should I use Z-ORDER BY with Delta Lake tables?

If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use Z-ORDER BY. Delta will automatically lay out the data in the files based on the column values and use the layout information to skip irrelevant data while querying.

What does it mean that Delta Lake supports multi-cluster writes?

It means that Delta Lake does locking to make sure that queries writing to a table from multiple clusters at the same time won’t corrupt the table. However, it does not mean that if there is a write conflict (for example, update and delete the same thing) that they will both succeed. Instead, one of writes will fail atomically and the error will tell you to retry the operation.

What are the limitations of multi-cluster writes?

Delta supports transactional writes from multiple clusters in the same workspace in Databricks Runtime 4.2 and above. All writers must be running Databricks Runtime 4.2 or above. The following features are not supported when running in this mode:

You can disable multi-cluster writes by setting spark.databricks.delta.multiClusterWrites.enabled to false. If they are disabled, writes to a single table must originate from a single cluster.

Warning

  • You cannot concurrently modify the same Delta Lake table from different workspaces.
  • Writes to a single table using Databricks Runtime versions lower than 4.2 must originate from a single cluster. To perform transactional writes from multiple clusters in the same workspace you must upgrade to Databricks Runtime 4.2.

Why is Delta Lake data I deleted still stored in S3?

If you are using Delta Lake and have enabled bucket versioning you have two entities managing table files: Delta and AWS. To ensure that data is fully deleted you must:

  • Clean up deleted files that are no longer in the Delta Lake transaction log using VACUUM.
  • Enable an S3 lifecycle policy for versioned objects that ensures that old versions of deleted files are purged.

Why does a table show old data after I delete Delta Lake files with rm -rf and create a new table in the same location?

Deletes on S3 are only eventually consistent. Thus after deleting a table old versions of the transaction log may still be visible for a while. To avoid this, do not reuse a table path after deleting it. Instead we recommend that you use transactional mechanisms like DELETE FROM, overwrite, and overwriteSchema to delete and update tables.

Can I access Delta Lake tables outside of Databricks Runtime?

There are two cases to consider: external writes and external reads.

  • External writes: Delta maintains additional metadata in the form of a transaction log to enable ACID transactions and snapshot isolation for readers. In order to ensure the transaction log is updated correctly and the proper validations are performed, writes must go through Databricks Runtime.

  • External reads: Delta Lake tables store data encoded in an open format (Parquet), allowing other tools that understand this format to read the data. However, since other tools do not support Delta Lake‘s transaction log, it is likely that they will incorrectly read stale deleted data, uncommitted data, or the partial results of failed transactions.

    In cases where the data is static (that is, there are no active jobs writing to the table), you can use VACUUM with a retention of ZERO HOURS to clean up any stale Parquet files that are not currently part of the table. This operation puts the Parquet files present in DBFS into a consistent state such that they can now be read by external tools.

    However, Delta Lake relies on stale snapshots for the following functionality, which will break when using VACUUM with zero retention allowance:

    • Snapshot isolation for readers - Long running jobs will continue to read a consistent snapshot from the moment the jobs started, even if the table is modified concurrently. Running VACUUM with a retention less than length of these jobs can cause them to fail with a FileNotFoundException.
    • Streaming from Delta Lake tables - Streams read from the original files written into a table in order to ensure exactly once processing. When combined with OPTIMIZE, VACUUM with zero retention can remove these files before the stream has time to processes them, causing it to fail.

    For these reasons we recommend the above technique only on static data sets that must be read by external tools.