What are all the Delta things in Databricks?

This article is an introduction to the technologies collectively branded Delta on Databricks. Delta refers to technologies related to or in the Delta Lake open source project.

This article answers:

What are the Delta technologies in Databricks?
What do they do? Or what are they used for?
How are they related to and distinct from one another?

What are the Delta things used for?

Delta is a term introduced with Delta Lake, the foundation for storing data and tables in the Databricks lakehouse. Delta Lake was conceived of as a unified data management system for handling transactional real-time and batch big data, by extending Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.

Delta Lake: OS data management for the lakehouse

Delta Lake is an open-source storage layer that brings reliability to data lakes by adding a transactional storage layer on top of data stored in cloud storage (on AWS S3, Azure Storage, and GCS). It allows for ACID transactions, data versioning, and rollback capabilities. It allows you to handle both batch and streaming data in a unified way.

Delta tables are built on top of this storage layer and provide a table abstraction, making it easy to work with large-scale structured data using SQL and the DataFrame API.

Delta tables: Default data table architecture

Delta table is the default data table format in Databricks and is a feature of the Delta Lake open source data framework. Delta tables are typically used for data lakes, where data is ingested via streaming or in large batches.

See:

Delta Lake quickstart: Create a table
Updating and modifying Delta Lake tables.
DeltaTable class: Main class for interacting programmatically with Delta tables.

Lakeflow Declarative Pipelines: Data pipelines

Lakeflow Declarative Pipelines manage the flow of data between many Delta tables, thus simplifying the work of data engineers on ETL development and management. The pipeline is the main unit of execution for Lakeflow Declarative Pipelines. Lakeflow Declarative Pipelines offers declarative pipeline development, improved data reliability, and cloud-scale production operations. Users can perform both batch and streaming operations on the same table and the data is immediately available for querying. You define the transformations to perform on your data, and Lakeflow Declarative Pipelines manages task orchestration, cluster management, monitoring, data quality, and error handling. Lakeflow Declarative Pipelines enhanced autoscaling can handle streaming workloads which are spiky and unpredictable.

See the Lakeflow Declarative Pipelines tutorial.

Delta tables vs. Lakeflow Declarative Pipelines

Delta table is a way to store data in tables, whereas Lakeflow Declarative Pipelines allows you to describe how data flows between these tables declaratively. Lakeflow Declarative Pipelines is a declarative framework that manages many delta tables, by creating them and keeping them up to date. In short, Delta tables is a data table architecture while Lakeflow Declarative Pipelines is a data pipeline framework.

Delta: Open source or proprietary?

A strength of the Databricks platform is that it doesn't lock customers into proprietary tools: Much of the technology is powered by open source projects, which Databricks contributes to.

The Delta OSS projects are examples:

Delta Lake project: Open source storage for a lakehouse.
Delta Sharing protocol: Open protocol for secure data sharing.

Lakeflow Declarative Pipelines is a proprietary framework in Databricks.

What are the other Delta things on Databricks?

Below are descriptions of other features that include Delta in their name.

An open standard for secure data sharing, Delta Sharing enables data sharing between organizations regardless of their compute platform.

Delta engine

A query optimizer for big data that uses Delta Lake open source technology included in Databricks. Delta engine optimizes the performance of Spark SQL, Databricks SQL, and DataFrame operations by pushing computation to the data.

Delta Lake transaction log (AKA DeltaLogs)

A single source of truth tracking all changes that users make to the table and the mechanism through which Delta Lake guarantees atomicity. See the Delta transaction log protocol on GitHub.

The transaction log is key to understanding Delta Lake, because it is the common thread that runs through many of its most important features:

ACID transactions
Scalable metadata handling
Time travel
And more.

What are the Delta things used for?​

Delta Lake: OS data management for the lakehouse​

Delta tables: Default data table architecture​

Lakeflow Declarative Pipelines: Data pipelines​

Delta tables vs. Lakeflow Declarative Pipelines​

Delta: Open source or proprietary?​

What are the other Delta things on Databricks?​

Delta Sharing​

Delta engine​

Delta Lake transaction log (AKA DeltaLogs)​