Skip to main content

Use joins in metric views

Joins in metric views support both direct joins from a fact table to dimension tables (star schema) and joins traversing from the fact table to dimension tables, and then to subdimension tables, allowing multi-hop joins across normalized dimension tables (snowflake schemas). This page explains how to define joins in the YAML definition of a metric view.

note

Joined tables cannot include MAP type columns. To learn how to unpack values from MAP type columns, see Explode nested elements from a map or array.

Model star schemas

In a star schema, the source is the fact table and joins with one or more dimension tables using a LEFT OUTER JOIN. Metric views join the fact and dimension tables that are needed for the specific query, based on the selected dimensions and measures.

Specify join columns in a metric view using either an ON clause or a USING clause.

  • ON clause: Uses a boolean expression to define the join condition.
  • USING clause: Lists columns with the same name in both the parent table and the joined table. For first-level joins, the parent is the metric view's source. For nested joins in a snowflake schema, the parent is the immediate upstream join.

The join should follow a many-to-one relationship. In cases of many-to-many, the first matching row from the joined dimension table is selected.

note

YAML 1.1 parsers (such as PyYAML) can misinterpret certain unquoted keys, such as on, off, yes, no, or NO, as boolean values. This can cause join errors.To avoid this issue, wrap these keys in quotes. For example: 'on': source.dim_fk = dim.pk

source: catalog.schema.fact_table

joins:

# The on clause supports a boolean expression
- name: dimension_table_1
source: catalog.schema.dimension_table_1
on: source.dimension_table_1_fk = dimension_table_1.pk

# The using clause supports an array of columns
# found in both of the tables being joined.
- name: dimension_table_2
source: catalog.schema.dimension_table_2
using:
- dimension_table_2_key_a
- dimension_table_2_key_b

dimensions:

# Dimension referencing a join column from dimension_table_1 using dot notation
- name: Dimension table 1 key
expr: dimension_table_1.pk

measures:

# Measure referencing a join column from dimension_table_1
- name: Count of dimension table 1 keys
expr: COUNT(dimension_table_1.pk)

note

The source namespace references columns from the metric view's source, while the join name refers to columns from the joined table. For example, in the join condition source.dimension_table_1_fk = dimension_table_1.pk, source refers to the metric view's source table (fact_table), and dimension_table_1 refers to the joined table. The reference defaults to the join table if no prefix is provided in an on clause.

Model snowflake schema

A snowflake schema extends a star schema by normalizing dimension tables and connecting them to subdimensions. This creates a multi-level join structure that can match the depth of your data model.

note

Snowflake joins require Databricks Runtime compute 17.1 and above.

To define a join that models a snowflake schema:

  1. Create a metric view.
  2. Add first-level (star schema) joins.
  3. Join with other dimension tables.
  4. Expose nested dimensions by adding dimensions in your view.

The following example uses the TPCH dataset to illustrate how to model a snowflake schema. The TPCH dataset can be accessed in the samples catalog in your Databricks workspace.

YAML
source: samples.tpch.orders

joins:
- name: customer
source: samples.tpch.customer
on: source.o_custkey = customer.c_custkey
joins:
- name: nation
source: samples.tpch.nation
on: customer.c_nationkey = nation.n_nationkey
joins:
- name: region
source: samples.tpch.region
on: nation.n_regionkey = region.r_regionkey

dimensions:
- name: clerk
expr: o_clerk
- name: customer
expr: customer # returns the full customer row as a struct
- name: customer_name
expr: customer.c_name
- name: nation
expr: customer.nation
- name: nation_name
expr: customer.nation.n_name