This set of data lakehouse architecture articles provides principles and best practices for the implementation and operation of a lakehouse using Databricks.
The well-architected lakehouse consists of 7 pillars which describe different areas of concern for the implementation of a data lakehouse in the cloud:
The oversight to ensure that data brings value and supports your business strategy.
Interoperability and usability
The ability of the lakehouse to interact with users and other systems.
All operations processes that keep the lakehouse running in production.
Security, privacy, compliance
Protect the Databricks application, customer workloads and customer data from threats.
The ability of a system to recover from failures and continue to function.
The ability of a system to adapt to changes in load.
Managing costs to maximize the value delivered.
The well-architected lakehouse extends the AWS Well-Architected Framework to the Databricks platform and shares the pillars “Operational Excellence”, “Security” (as “Security, privacy, compliance”), “Reliability”, “Performance Efficiency” and “Cost Optimization”.
For these five pillars, the principles and best practices of the cloud framework still apply to the lakehouse. The well-architected lakehouse extends these with principles and best practices that are specific for the lakehouse and important to build an effective and efficient lakehouse.
The pillars “Data Governance” and “Interoperability and Usability” cover concerns specific for the lakehouse.
Data governance encapsulates the policies and practices implemented to securely manage the data assets within an organization. One of the fundamental aspects of a lakehouse is centralized data governance: The lakehouse unifies data warehousing and AI uses cases on a single platform. This simplifies the modern data stack by eliminating the data silos that traditionally separate and complicate data engineering, analytics, BI, data science, and machine learning. To simplify data governance, the lakehouse offers a unified governance solution for data, analytics and AI. By minimizing the copies of your data and moving to a single data processing layer where all your data governance controls can run together, you improve your chances of staying in compliance and detecting a data breach.
Another important tenet of the lakehouse is to provide a great user experience for all the personas that work with it, and to be able to interact with a wide ecosystem of external systems. AWS already has a variety of data tools that perform most tasks a data-driven enterprise might need. However, these tools must be properly assembled to provide all the functionality, with each service offering a different user experience. This approach can lead to high implementation costs and typically does not provide the same user experience as a native lakehouse platform: Users are limited by inconsistencies between tools and a lack of collaboration capabilities, and often have to go through complex processes to gain access to the system and thus to the data.
An integrated lakehouse on the other side provides a consistent user experience across all workloads and therefore increases usability. This reduces training and onboarding costs and improves collaboration between functions. In addition, new features are automatically added over time - to further improve the user experience - without the need to invest internal resources and budgets.
A multicloud approach can be a deliberate strategy of a company or the result of mergers and acquisitions or independent business units selecting different cloud providers. In this case, using a multicloud lakehouse results in a unified user experience across all clouds. This reduces the proliferation of systems across the enterprise, which in turn reduces the skill and training requirements of employees involved in data-driven tasks.
Finally, in a networked world with cross-company business processes, systems must work together as seamlessly as possible. The degree of interoperability is a crucial criterion here, and the most recent data, as a core asset of any business, must flow securely between internal and external partners systems.