Procedural vs. declarative data processing in Databricks
This article covers the differences between procedural and declarative programming and their usage in Databricks.
Procedural and declarative programming are two fundamental programming paradigms in computer science. Each represents a different approach to structuring and executing instructions.
- With procedural programming you specify how tasks should be accomplished by defining explicit sequences of operations.
- Declarative programming focuses on what needs to be achieved, leaving the underlying system to determine the best way to execute the task.
When designing data pipelines, engineers must choose between procedural and declarative data processing models. This decision impacts workflow complexity, maintainability, and efficiency. This page explains these models' key differences, advantages and challenges, and when to use each approach.
What is procedural data processing?
Procedural data processing follows a structured approach where explicit steps are defined to manipulate data. This model is closely aligned with imperative programming, emphasizing a command sequence that dictates how the data should be processed.
Characteristics of procedural processing
The following are characteristics of procedural processing:
- Step-by-step execution: The developer explicitly defines the order of operations.
- Use of control structures: Loops, conditionals, and functions manage execution flow.
- Detailed resource control: Enables fine-grained optimizations and manual performance tuning.
- Related concepts: Procedural programming is a sub-class of imperative programming.
Common use cases for procedural processing
The following are everyday use cases for procedural processing:
- Custom ETL pipelines requiring procedural logic.
- Low-level performance optimizations in batch and streaming workflows.
- Legacy systems or existing imperative scripts.
Procedural processing with Apache Spark and Databricks Jobs
Apache Spark primarily follows a procedural model for data processing. Use Databricks Jobs to add explicit execution logic to define step-by-step transformations and actions on distributed data.
What is declarative data processing?
Declarative data processing abstracts the how and focuses on defining the desired result. Instead of specifying step-by-step instructions, developers define transformation logic, and the system determines the most efficient execution plan.
Characteristics of declarative processing
The following are characteristics of declarative processing:
- Abstraction of execution details: Users describe the desired outcome, not the steps to achieve it.
- Automatic optimization: The system applies query planning and execution tuning.
- Reduced complexity: Removes the need for explicit control structures, improving maintainability.
- Related concepts: Declarative programming includes domain-specific and functional programming paradigms.
Common use cases for declarative processing
The following are common use cases for declarative processing:
- SQL-based transformations in batch and streaming workflows.
- High-level data processing frameworks such as Delta Live Tables (DLT).
- Scalable, distributed data workloads requiring automated optimizations.
Declarative processing with DLT
DLT is a declarative framework designed to simplify the creation of reliable and maintainable stream processing pipelines. By specifying what data to ingest and how to transform it, DLT automates key aspects of pipeline management, including orchestration, compute management, monitoring, data quality enforcement, and error handling.
Key differences: procedural vs. declarative processing
Aspect | Procedural processing | Declarative processing |
---|---|---|
Control | Full control over execution | Execution handled by system |
Complexity | Can be complex and verbose | Generally simpler and more concise |
Optimization | Requires manual tuning | System handles optimization |
Flexibility | High, but requires expertise | Lower, but easier to use |
Use Cases | Custom pipelines, performance tuning | SQL queries, managed pipelines |
When to choose procedural or declarative processing
The following table outlines some of the key decision points for procedural and declarative processing:
Procedural processing | Declarative processing |
---|---|
Fine-grained control over execution logic is required. | Simplified development and maintenance are priorities. |
Transformations involve complex business rules that are difficult to express declaratively. | SQL-based transformations or managed workflows eliminate the need for procedural control. |
Performance optimizations necessitate manual tuning. | Data processing frameworks such as DLT provide built-in optimizations. |