Task 6: Set up your data architecture

Estimated time to complete: 2 hours

One of the key constraints on making useful data-driven decisions is the structure, accessibility, and quality of the underlying data stores. For this reason it is important to have a well-planned strategy on data access for all end users.

Our recommendation is to take advantage of the data storage format provide by Delta Lake. The Delta table format is a widely-used standard for enterprise data lakes at massive scale. Built on the foundation of another open source format—Parquet—Delta Lake adds advanced features and capabilities that enable additional robustness, speed, versioning, and data-warehouse-like ACID compliance. This is on top of the existing cost benefits of using existing cheap blob storage services.

Databricks has built-in support for Delta Lake, and the latest Databricks Runtimes include performance enhancements for even more speed and performance. See this presentation for a full discussion of Delta Lake and its capabilities:

Making Apache Spark better with Delta Lake

When factored into your overall data strategy, data pipelines built on Delta Lake should follow a tiered multi-hop strategy. This is a successive pattern of data cleaning and transformation from raw ingest (bronze level) to semi-processed (silver level) to the most-processed, business-ready tables (gold level). You can view a more thorough examination of this approach in this presentation:

Simplify and Scale Data Engineering Pipelines

When you are done, return to this page and click this button to continue the Getting Started path for data engineers:

Continue onboarding