Skip to main content

Load, transform, and write data with pipelines

The articles in this section provide common patterns, recommendations, and examples of data ingestion and transformation in DLT pipelines and writing transformed data to external services. When ingesting source data to create the initial datasets in a pipeline, these initial datasets are commonly called bronze tables and often perform simple transformations. By contrast, the final tables in a pipeline, commonly referred to as gold tables, often require complicated aggregations or reading from sources that are the targets of an APPLY CHANGES INTO operation.

Load data

You can load data from any data source supported by Apache Spark on Databricks using DLT. For examples of patterns for loading data from different sources, including cloud object storage, message buses like Kafka, and external systems like PostgreSQL, see Load data with DLT. These examples feature recommendations like using streaming tables with Auto Loader in DLT for an optimized ingestion experience.

Data flows

In DLT, a flow is a streaming query that processes source data incrementally to update a target streaming table. Many streaming queries needed to implement a DLT pipeline create an implicit flow as part of the query definition. DLT also supports explicitly declaring flows when more specialized processing is required. To learn more about DLT flows and see examples of using flows to implement data processing tasks, see Load and process data incrementally with DLT flows.

Change data capture (CDC)

Change data capture (CDC) is a data integration pattern that captures changes made to data in a source system, such as inserts, updates, and deletes. CDC is commonly used to efficiently replicate tables from a source system into Databricks. DLT simplifies CDC with the APPLY CHANGES API. By automatically handling out-of-sequence records, the APPLY CHANGES API in DLT ensures correct processing of CDC records and removes the need to develop complex logic for handling out-of-sequence records. See What is change data capture (CDC)? and The APPLY CHANGES APIs: Simplify change data capture with DLT.

Transform data

With DLT, you can declare transformations on datasets and specify how records are processed through query logic. For examples of common transformation patterns when building out DLT pipelines, including usage of streaming tables, materialized views, stream-static joins, and MLflow models in pipelines, see Transform data with pipelines.

Optimize stateful processing in DLT with watermarks

To effectively manage data kept in state, you can use watermarks when performing stateful stream processing in DLT, including aggregations, joins, and deduplication. In stream processing, a watermark is an Apache Spark feature that can define a time-based threshold for processing data when performing stateful operations. Data arriving is processed until the threshold is reached, at which point the time window defined by the threshold is closed. Watermarks can be used to avoid problems during query processing, mainly when processing larger datasets or long-running processing.

For examples and recommendations, see Optimize stateful processing in DLT with watermarks.

Write records to external services with DLT sinks

Preview

The DLT sink API is in Public Preview.

In addition to persisting transformed data to Databricks managed Delta tables in Unity Catalog and the Hive metastore, you can use DLT sinks to persist to external targets, including event streaming services like Apache Kafka or Azure Event Hubs, and external tables managed by Unity Catalog or the Hive metastore. See Stream records to external services with DLT sinks.