Introduction to Databricks Delta

Note

Databricks Delta is in Private Preview. Contact your account manager or go to https://databricks.com/product/databricks-delta to request access.

Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks DBFS. The core abstraction of Databricks Delta is an optimized Spark table that

  • Stores data as Parquet files in DBFS.
  • Maintains a transaction log that efficiently tracks changes to the table.

You read and write data stored in the delta format using the same familiar Apache Spark SQL batch and streaming APIs that you use to work with Hive tables and DBFS directories. With the addition of the transaction log and other enhancements, Databricks Delta offers significant benefits:

ACID transactions
  • Multiple writers can simultaneously modify a dataset and see consistent views.
  • Writers can modify a dataset without interfering with jobs reading the dataset.
Fast read access
  • Automatic file management organizes data into large files that can be read efficiently.
  • Statistics enable speeding up reads by 10-100x and and data skipping avoids reading irrelevant information.

Requirements

Databricks Delta requires Databricks Runtime 4.1 or above. If you created a Databricks Delta table using a Databricks Runtime lower than 4.1, the table version must be upgraded. For details, see Table Versioning.

Frequently asked questions (FAQ)

How do Databricks Delta tables compare to Hive SerDe tables?

Databricks Delta tables are managed to a greater degree. In particular, there are several Hive SerDe parameters that Databricks Delta manages on your behalf that you should never specify manually:

  • ROWFORMAT
  • SERDE
  • OUTPUTFORMAT AND INPUTFORMAT
  • COMPRESSION
  • STORED AS
Does Databricks Delta support multi-table transactions?
Databricks Delta does not support multi-table transactions and foreign keys. Databricks Delta supports transactions at the table level.
Does Databricks Delta support writes or reads using the Spark Streaming DStream API?
Databricks Delta does not support the DStream API. We recommend Structured Streaming.
What DDL and DML features does Databricks Delta not support?
  • Unsupported DDL features:
    • ANALYZE TABLE PARTITION
    • ALTER TABLE [ADD|DROP] PARTITION
    • ALTER TABLE SET LOCATION
    • ALTER TABLE RECOVER PARTITIONS
    • ALTER TABLE SET SERDEPROPERTIES
    • CREATE TABLE LIKE
    • INSERT OVERWRITE DIRECTORY
    • LOAD DATA
    • TRUNCATE
  • Unsupported DML features:
    • INSERT INTO [OVERWRITE] with static partitions.
    • Subqueries in the WHERE conditions of UPDATE and DELETE.
    • Bucketing.
    • Specifying a schema when reading from a table. A command such as spark.read.format("delta").schema(df.schema).load(path) will fail.
What are the limitations of transactional writes?

Databricks Delta supports transactional writes from different clusters in the same workspace in Databricks Runtime 4.2 and above. All writers must be running Databricks Runtime 4.2 or above. The following features are not supported when running in this mode:

You can disable multi-cluster writes by setting spark.databricks.delta.multiClusterWrites.enabled to false. If they are disabled, writes to a single table must originate from a single cluster.

Warning

  • You cannot concurrently modify the same Databricks Delta table from different workspaces.
  • Writes to a single table using Databricks Runtime versions lower than 4.2 must originate from a single cluster. To perform transactional writes from multiple clusters in the same workspace you must upgrade to Databricks Runtime 4.2.