Performance efficiency for the data lakehouse
This article covers architectural principles of the performance efficiency pillar, referring to the ability of a system to adapt to load changes.
Principles of performance efficiency
Use serverless architectures
Serverless architectures do not require customers to operate and maintain computing infrastructure in the cloud. This eliminates the operational overhead of managing cloud infrastructure and reduces transaction costs because managed services operate at cloud scale. They also provide immediate availability, out-of-the-box security, and require minimal configuration or administration.
Design workloads for performance
For repeated workloads, such as data engineering pipelines, performance should never be an afterthought. Data must be:
Efficiently read from object memory.
Efficiently transformed.
Efficiently published for consumption.
In addition, most pipelines or consumption patterns use a chain of systems. To achieve the best possible performance, the entire chain must be considered and selected for the best performance.
Run performance testing in the scope of development
Every development workload must undergo continuous performance testing. The tests ensure that any change to the code base does not adversely affect the performance of the workload. Establish a regular schedule for running tests. Run the test as part of a scheduled event or as part of a continuous integration build pipeline.
Establish performance baselines and determine the current efficiency of the workloads and supporting infrastructure. Measuring performance against baselines can provide strategies for improvement and determine if the application meets business objectives.
Identify bottlenecks that may be affecting performance. These bottlenecks can be caused by code errors or misconfiguration of a service. Typically, bottlenecks get worse as load increases.
Monitor performance
Ensure that resources and services remain accessible, and that performance meets user expectations or workload requirements. Monitoring can help you identify bottlenecks or insufficient resources, optimize configurations, and detect pipeline/workload errors.