Skip to main content

What is DLT?

DLT is a declarative framework for developing and running batch and streaming data pipelines in SQL and Python. DLT runs on the performance-optimized Databricks Runtime (DBR), and the DLT flows API uses the same DataFrame API as Apache Spark and Structured Streaming. Common use cases for DLT include incremental data ingestion from sources such as cloud storage (such as Amazon S3, Azure ADLS Gen2, and Google Cloud Storage) and message buses (such as Apache Kafka, Amazon Kinesis, Google Pub/Sub, Azure EventHub, and Apache Pulsar), incremental batch and streaming transformations with stateless and stateful operators, and real-time stream processing between transactional stores like message buses and databases.

What are the benefits of DLT?

The declarative nature of DLT provides the following benefits when compared to data pipelines built with Apache Spark or Spark Structured Streaming using Databricks Jobs:

  • Automatic Orchestration: A DLT pipeline orchestrates processing steps (called "flows") automatically to ensure the correct order of execution and the maximum level of parallelism for optimal performance. Additionally, DLT pipelines automatically and efficiently retry transient failures. The retry process begins with the most granular and cost-effective unit: the Spark task. If the task-level retry fails, DLT proceeds to retry the flow, and then finally the entire pipeline if necessary.
  • Declarative Processing: DLT provides declarative functions that can reduce hundreds or even thousands lines of manual Spark and Structured Streaming code to only a few lines. For example, the Apply Changes API simplifies processing of Change Data Capture (CDC) events with support for both SCD Type 1 and SCD Type 2. It eliminates the need for manual code to handle out-of-order events, and it does not require an understanding of streaming semantics or concepts like watermarks. Another example is Enzyme, Databricks' incremental processing engine for materialized view flows. To use it, you write your transformation logic with batch semantics, and Enzyme will only process new data and changes in the data sources whenever possible. Using Enzyme reduces inefficient reprocessing when new data or changes occur in the sources and eliminates the need for manual code to handle incremental processing.

Key Concepts

The diagram below illustrates the most important concepts of DLT.

A diagram that shows how the core concepts of DLT relate to each other at a very high level

A flow is the foundational data processing concept in DLT which supports both streaming and batch semantics. A flow reads data from a source, applies user-defined processing logic, and writes the result into a target. DLT shares some of the same flow types as Spark Structured Streaming: specifically, the Append, Update, and Complete streaming flows. For more details, see output modes in Structured Streaming.

DLT also provides additional flow types:

  • Apply Change is a unique streaming flow in DLT that handles out of order CDC events and supports both SCD Type 1 and SCD Type 2.
  • Materialized View is a unique batch flow in DLT that only processes new data and changes in the sources, whenever possible.

A Streaming table is a form of Unity Catalog managed table, and is a streaming target for DLT. A streaming table can have one or more streaming flows (Append, Update, Complete, Apply Changes) written into it. Apply Changes is a unique streaming flow that’s only available to streaming tables. You can define streaming flows explicitly and separately from their target streaming table. You can also define streaming flows implicitly as part of a streaming table definition.

A Materialized view is also a form of Unity Catalog managed table, and is a batch target. A materialized view can have one or more materialized view flows written into it. Materialized views differ from streaming tables in that you always define the flows implicitly as part of the materialized view definition.

A sink is a streaming target for DLT and currently supports Delta tables, Apache Kafka topics, and Azure EventHubs topics. A sink can have one or more streaming flows (Append, Update, Complete) written into it.

A pipeline is the unit of development and execution in DLT. A pipeline can contain one or more flows, streaming tables, materialized views, and sinks. You use DLT by defining flows, streaming tables, materialized views, and sinks in your pipeline source code and then running the pipeline. While your pipeline runs, it analyzes the dependencies of your defined flows, streaming tables, materialized views, and sinks, and orchestrates their order of execution and parallelization automatically.

DLT for Databricks SQL

DLT provides streaming tables and materialized views as two foundational ETL capabilities in Databricks SQL. You can use standard SQL to create and refresh streaming tables and materialized views in Databricks SQL. Streaming tables and materialized views in Databricks SQL run on the same Databricks infrastructure and have the same processing semantics as they do in a DLT pipeline. When you use streaming tables and materialized views in Databricks SQL, flows are defined implicitly as part of the streaming tables and materialized views definition.

More information