Skip to main content

Data preparation cheat sheet

High-end BI performance depends on how effectively data is prepared and delivered from the Lakehouse. By adopting architectural patterns, applying a semantic structure, and using targeted optimizations, you can reduce query complexity, improve dashboard responsiveness, and lower compute costs.

The following table summarizes recommended practices, their expected impact, related documentation, and associated action items. This content is intended for data engineers, BI developers, and dashboard authors who design, optimize, and maintain analytics workloads in the Lakehouse.

Data prep

Best practice

Impact

Docs

Action items

Adopt a medallion architecture

Speeds up turning raw data into ready-to-use, reliable data products for easy consumption.

Review and implement medallion layers

Use liquid clustering

Improves query performance with file and data skipping.

Apply to large tables with filter patterns

Use managed tables

Databricks auto-governs and optimizes the storage layer and query performance.

Create managed tables for your data

Use predictive optimization or optimize tables manually

Enables better query performance by optimizing file sizes and layout, deleting old files, and updating statistics.

Enable for production tables or schedule regular optimization and analyze tables after data changes

Model data in a star schema pattern

Makes the data easy to query and consume.

Design fact and dimension tables

Avoid wide data types and high-cardinality columns

Optimizes data model size and memory consumption, and improves query efficiency.

Review data types and cardinality

Declare Primary and Foreign Keys (with RELY)

Optimizes queries by eliminating unnecessary joins and aggregations.

Define keys on fact and dimension tables

Use auto-generated columns

Reduces the need to calculate values at query time.

Identify frequently calculated fields

Use materialized views and persisted tables

Improves performance by pre-aggregating data for the most common and resource-intensive queries.

Create aggregated views for common queries