What is data transformation on Databricks?
Data transformation is the process of converting, cleansing, and structuring data into a usable format. Data transformation typically follows the Databricks medallion architecture of incrementally refining data from raw into a format consumable by the business.
The following diagram shows a data pipeline containing a series of data transformations that turn the raw_customers
dataset into the clean_customers
dataset by dropping customer data with no customer name in this example. The raw_transactions
data is turned into clean_transactions
by dropping transactions with a zero dollar value. A resulting dataset called sales_report
is the joining the clean_customers
and clean_transactions
. Analysts can use sales_report
for analytics and business intelligence.
This article focuses on defining tranformations as they relate to the T in ETL or ELT. The Apache Spark processing model also uses the word transformation in a related way. See Spark transformations and actions.
Types of data transformations
Databricks considers two types of data transformations: declarative and procedural. The data pipeline in the preceding example can be expressed using either paradigm.
Declarative transformations focus on the desired outcome rather than how to achieve it. You specify the logic of the transformation using higher-level abstractions, and DLT determines the most efficient way to execute it.
Procedural data transformations focus on performing computations through explicit instructions. Those computations define the exact sequence of operations to manipulate the data. The procedural approach provides more control over execution but at the cost of greater complexity and higher maintenance.
Choosing between declarative and procedural data transformation
Declarative data transformation using DLT is best when:
- You require rapid development and deployment.
- Your data pipelines have standard patterns that do not require low-level control over execution.
- You need built-in data quality checks.
- Maintenance and readability are top priorities.
Procedural data transformation using Apache Spark code is best when:
- You are migrating an existing Apache Spark codebase to Databricks.
- You need fine-grained control over execution.
- You need access to low-level APIs such as
MERGE
orforeachBatch
. - You need to write data to Kafka or external Delta tables.
What are the differences between streaming and batch processing?
While streaming and batch processing use much of the same syntax on Databricks, each have their own specific semantics.
Batch processing allows you to define explicit instructions to process a fixed amount of static, non-changing data as a single operation.
Stream processing allows you to define a query against an unbounded, continuously growing dataset and then process data in small, incremental batches.
Batch operations on Databricks use Spark SQL or DataFrames, while stream processing leverages Structured Streaming.
You can differentiate batch Apache Spark commands from Structured Streaming by looking at read and write operations, as shown in the following table:
Apache Spark | Structured Streaming | |
---|---|---|
Read |
|
|
Write |
|
|
Materialized views generally conform to batch processing guarantees, although DLT is used to calculate results incrementally when possible. The results returned by a materialized view are always equivalent to batch evaluation of logic, but Databricks seeks to process these results incrementally when possible.
Streaming tables always calculate results incrementally. Because many streaming data sources only retain records for a period of hours or days, the processing model used by streaming tables assumes that each batch of records from a data source is only processed once.
Databricks supports using SQL to write streaming queries in the following use cases:
- Defining streaming tables in Unity Catalog using Databricks SQL.
- Defining source code for DLT pipelines.
You can also declare streaming tables in DLT using Python Structured Streaming code.
Batch transformations
Batch transformations operate on a well-defined set of data assets at a specific point in time. Batch transformations might be one-time operations, but often are part of scheduled jobs or pipelines that run regularly to keep production systems up to date.
Incremental transformations
Incremental patterns generally assume that the data source is append-only and has a stable schema. The following articles provide details on nuances for incremental transformations on tables that experience updates, deletes, or schema changes:
- The APPLY CHANGES APIs: Simplify change data capture with DLT
- Use Delta Lake change data feed on Databricks
- Update Delta Lake table schema
- Delta table streaming reads and writes
Near real-time transformations
Delta Lake excels at providing near real-time access to large amounts of data for all users and applications querying your lakehouse. Because of the overhead with writing files and metadata to cloud object storage, true real-time latency cannot be reached for many workloads that write to Delta Lake sinks.
For lower latency streaming applications, Databricks recommends choosing source and sink systems designed for real-time workloads such as Kafka. You can use Databricks to enrich data, including aggregations, joins across streams, and joining streaming data with slowly changing dimension data stored in the lakehouse.