What are all the Delta things in Databricks?
This article is an introduction to the technologies collectively branded Delta on Databricks. Delta refers to technologies related to or in the Delta Lake open source project.
This article answers:
What are the Delta technologies in Databricks?
What do they do? Or what are they used for?
How are they related to and distinct from one another?
What are the Delta things used for?
Delta is a term introduced with Delta Lake, the foundation for storing data and tables in the Databricks Lakehouse Platform. Delta Lake was conceived of as a unified data management system for handling transactional real-time and batch big data, by extending Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.
Delta Lake: OS data management for the lakehouse
Delta Lake is an open-source storage layer that brings reliability to data lakes by adding a transactional storage layer on top of data stored in cloud storage (on AWS S3, Azure Storage, and GCS). It allows for ACID transactions, data versioning, and rollback capabilities. It allows you to handle both batch and streaming data in a unified way.
Delta tables are built on top of this storage layer and provide a table abstraction, making it easy to work with large-scale structured data using SQL and the DataFrame API.
Delta tables: Default data table architecture
Delta table is the default data table format in Databricks and is a feature of the Delta Lake open source data framework. Delta tables are typically used for data lakes, where data is ingested via streaming or in large batches.
DeltaTable class: Main class for interacting programmatically with Delta tables.
Delta Live Tables: Data pipelines
Delta Live Tables manage the flow of data between many Delta tables, thus simplifying the work of data engineers on ETL development and management. The pipeline is the main unit of execution for Delta Live Tables. Delta Live Tables offers declarative pipeline development, improved data reliability, and cloud-scale production operations. Users can perform both batch and streaming operations on the same table and the data is immediately available for querying. You define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling. Delta Live Tables Enhanced Autoscaling can handle streaming workloads which are spiky and unpredictable.
Delta tables vs. Delta Live Tables
Delta table is a way to store data in tables, whereas Delta Live Tables allows you to describe how data flows between these tables declaratively. Delta Live Tables is a declarative framework that manages many delta tables, by creating them and keeping them up to date. In short, Delta tables is a data table architecture while Delta Live Tables is a data pipeline framework.
Delta: Open source or proprietary?
A strength of the Databricks platform is that it doesn’t lock customers into proprietary tools: Much of the technology is powered by open source projects, which Databricks contributes to.
The Delta OSS projects are examples:
Delta Lake project: Open source storage for the Lakehouse.
Delta Sharing protocol: Open protocol for secure data sharing.
Delta Live Tables is a proprietary framework in Databricks.
What are the other Delta things on Databricks?
Below are descriptions of other features that include Delta in their name.
An open standard for secure data sharing, Delta Sharing enables data sharing between organizations regardless of their compute platform.
A query optimizer for big data that uses Delta Lake open source technology included in Databricks. Delta engine optimizes the performance of Spark SQL, Databricks SQL, and DataFrame operations by pushing computation to the data.
Delta Lake transaction log (AKA DeltaLogs)
A single source of truth tracking all changes that users make to the table and the mechanism through which Delta Lake guarantees atomicity. See the Delta transaction log protocol on GitHub.
The transaction log is key to understanding Delta Lake, because it is the common thread that runs through many of its most important features:
Scalable metadata handling