Data modeling

Data modeling decisions depend on how your organization and workloads use tables, and the model you choose affects query performance, compute costs, and storage costs. This page covers the Databricks behaviors that influence data modeling, for users setting up new tables or authoring ETL workloads.

important

This article exclusively applies to tables backed by Delta Lake, which includes all Unity Catalog managed tables.

You can use Databricks to query other external data sources, including tables registered with Lakehouse Federation. Each external data source has different limitations, semantics, and transactional guarantees. See Query data.

Database management concepts

A lakehouse built with Databricks shares many components and concepts with other enterprise data warehousing systems. Consider the following concepts and features while designing your data model.

Transactions on Databricks

Databricks scopes transactions to individual tables. This means that Databricks does not support multi-table statements (also called multi-statement transactions).

For data modeling workloads, this translates to having to perform multiple independent transactions when ingesting a source record requires inserting or updating rows into two or more tables. Each of these transactions can succeed or fail independent of other transactions, and downstream queries need to be tolerant of state mismatch due to failed or delayed transactions.

Primary and foreign keys on Databricks

Primary and foreign keys are informational and not enforced. This model is common in many enterprise cloud-based database systems, but differs from many traditional relational database systems. See Constraints on Databricks.

Joins on Databricks

Joins can introduce processing bottlenecks in any database design. When processing data on Databricks, the query optimizer seeks to optimize the plan for joins, but can struggle when an individual query must join results from many tables. The optimizer can also fail to skip records in a table when filter parameters are on a field in another table, which can result in a full table scan.

See Work with joins on Databricks.

note

You can use materialized views to incrementally compute the results for some join operations, but other joins are not compatible with materialized views. See Materialized views.

Working with nested and complex data types

Databricks supports working with semi-structured data sources including JSON, Avro, and Protobuf, and storing complex data as structs, JSON strings, and maps and arrays. See Model semi-structured data.

Normalized data models

Databricks can work well with any data model. If you have an existing data model that you need to query from or migrate to Databricks, you should evaluate performance before rearchitecting your data.

If you are architecting a new lakehouse or adding datasets to an existing environment, Databricks recommends against using a heavily normalized model such as third normal form (3NF).

Models like the star schema or snowflake schema perform well on Databricks, as there are fewer joins present in standard queries and fewer keys to keep in sync. In addition, having more data fields in a single table allows the query optimizer to skip large amounts of data using file-level statistics. For more on data skipping, see Data skipping.

Database management concepts​

Transactions on Databricks​

Primary and foreign keys on Databricks​

Joins on Databricks​

Working with nested and complex data types​

Normalized data models​