What is data transformation on Databricks?

Data transformation is the process of converting, cleansing, and structuring data into a usable format. Data transformation typically follows the Databricks medallion architecture of incrementally refining data from raw into a format consumable by the business.

The following diagram shows a data pipeline containing a series of data transformations that turn the raw_customers dataset into the clean_customers dataset by dropping customer data with no customer name in this example. The raw_transactions data is turned into clean_transactions by dropping transactions with a zero dollar value. A resulting dataset called sales_report is the joining the clean_customers and clean_transactions. Analysts can use sales_report for analytics and business intelligence.

Data transformation example

This article focuses on defining tranformations as they relate to the T in ETL or ELT. The Apache Spark processing model also uses the word transformation in a related way. See Spark transformations and actions.

Types of data transformations

Databricks considers two types of data transformations: declarative and procedural. The data pipeline in the preceding example can be expressed using either paradigm.

Declarative transformations focus on the desired outcome rather than how to achieve it. You specify the logic of the transformation using higher-level abstractions, and DLT determines the most efficient way to execute it.

Procedural data transformations focus on performing computations through explicit instructions. Those computations define the exact sequence of operations to manipulate the data. The procedural approach provides more control over execution but at the cost of greater complexity and higher maintenance.

Choosing between declarative and procedural data transformation

Declarative data transformation using DLT is best when:

You require rapid development and deployment.
Your data pipelines have standard patterns that do not require low-level control over execution.
You need built-in data quality checks.
Maintenance and readability are top priorities.

Procedural data transformation using Apache Spark code is best when:

You are migrating an existing Apache Spark codebase to Databricks.
You need fine-grained control over execution.
You need access to low-level APIs such as MERGE or foreachBatch.
You need to write data to Kafka or external Delta tables.

What are the differences between streaming and batch processing?

While streaming and batch processing use much of the same syntax on Databricks, each have their own specific semantics.

Batch processing allows you to define explicit instructions to process a fixed amount of static, non-changing data as a single operation.

Stream processing allows you to define a query against an unbounded, continuously growing dataset and then process data in small, incremental batches.

Batch operations on Databricks use Spark SQL or DataFrames, while stream processing leverages Structured Streaming.

You can differentiate batch Apache Spark commands from Structured Streaming by looking at read and write operations, as shown in the following table:

	Apache Spark	Structured Streaming
Read	`spark.read.load()`	`spark.readStream.load()`
Write	`spark.write.save()`	`spark.writeStream.start()`

Materialized views generally conform to batch processing guarantees, although DLT is used to calculate results incrementally when possible. The results returned by a materialized view are always equivalent to batch evaluation of logic, but Databricks seeks to process these results incrementally when possible.

Streaming tables always calculate results incrementally. Because many streaming data sources only retain records for a period of hours or days, the processing model used by streaming tables assumes that each batch of records from a data source is only processed once.

Databricks supports using SQL to write streaming queries in the following use cases:

Defining streaming tables in Unity Catalog using Databricks SQL.
Defining source code for DLT pipelines.

You can also declare streaming tables in DLT using Python Structured Streaming code.

Batch transformations

Batch transformations operate on a well-defined set of data assets at a specific point in time. Batch transformations might be one-time operations, but often are part of scheduled jobs or pipelines that run regularly to keep production systems up to date.

Incremental transformations

Incremental patterns generally assume that the data source is append-only and has a stable schema. The following articles provide details on nuances for incremental transformations on tables that experience updates, deletes, or schema changes:

Near real-time transformations

Delta Lake excels at providing near real-time access to large amounts of data for all users and applications querying your lakehouse. Because of the overhead with writing files and metadata to cloud object storage, true real-time latency cannot be reached for many workloads that write to Delta Lake sinks.

For lower latency streaming applications, Databricks recommends choosing source and sink systems designed for real-time workloads such as Kafka. You can use Databricks to enrich data, including aggregations, joins across streams, and joining streaming data with slowly changing dimension data stored in the lakehouse.

Types of data transformations​

Choosing between declarative and procedural data transformation​

What are the differences between streaming and batch processing?​

Batch transformations​

Incremental transformations​

Near real-time transformations​