Use joins in metric views
Joins in metric views support both direct joins from a fact table to dimension tables (star schema) and joins traversing from the fact table to dimension tables, and then to subdimension tables, allowing multi-hop joins across normalized dimension tables (snowflake schemas). This page explains how to define joins in the YAML definition of a metric view.
Model star schemas
In a star schema, the source
is the fact table and joins with one or more dimension tables using a LEFT OUTER JOIN
. Metric views join the fact and dimension tables that are needed for the specific query, based on the selected dimensions and measures.
Specify join columns in a metric view using either an ON
clause or a USING
clause.
ON
clause: Uses a boolean expression to define the join condition.USING
clause: Lists columns with the same name in both the parent table and the joined table. For first-level joins, the parent is the metric view's source. For nested joins in a snowflake schema, the parent is the immediate upstream join.
The join should follow a many-to-one relationship. In cases of many-to-many, the first matching row from the joined dimension table is selected.
YAML 1.1 parsers (such as PyYAML) can misinterpret certain unquoted keys, such as on
, off
, yes
, no
, or NO
, as boolean values. This can cause join errors.To avoid this issue, wrap these keys in quotes. For example: 'on': source.dim_fk = dim.pk
source: catalog.schema.fact_table
joins:
# The on clause supports a boolean expression
- name: dimension_table_1
source: catalog.schema.dimension_table_1
on: source.dimension_table_1_fk = dimension_table_1.pk
# The using clause supports an array of columns
# found in both of the tables being joined.
- name: dimension_table_2
source: catalog.schema.dimension_table_2
using:
- dimension_table_2_key_a
- dimension_table_2_key_b
dimensions:
# Dimension referencing a join column from dimension_table_1 using dot notation
- name: Dimension table 1 key
expr: dimension_table_1.pk
measures:
# Measure referencing a join column from dimension_table_1
- name: Count of dimension table 1 keys
expr: COUNT(dimension_table_1.pk)
The source
namespace references columns from the metric view's source, while the join name
refers to columns from the joined table. For example, in the join condition source.dimension_table_1_fk = dimension_table_1.pk
, source
refers to the metric view's source table (fact_table
), and dimension_table_1
refers to the joined table. The reference defaults to the join table if no prefix is provided in an on
clause.
Model snowflake schema
A snowflake schema extends a star schema by normalizing dimension tables and connecting them to subdimensions. This creates a multi-level join structure that can match the depth of your data model.
Snowflake joins require Databricks Runtime compute 17.1 and above.
To define a join that models a snowflake schema:
- Create a metric view.
- Add first-level (star schema) joins.
- Join with other dimension tables.
- Expose nested dimensions by adding dimensions in your view.
The following example uses the TPCH dataset to illustrate how to model a snowflake schema. The TPCH dataset can be accessed in the samples
catalog in your Databricks workspace.
source: samples.tpch.orders
joins:
- name: customer
source: samples.tpch.customer
on: source.o_custkey = customer.c_custkey
joins:
- name: nation
source: samples.tpch.nation
on: customer.c_nationkey = nation.n_nationkey
joins:
- name: region
source: samples.tpch.region
on: nation.n_regionkey = region.r_regionkey
dimensions:
- name: clerk
expr: o_clerk
- name: customer
expr: customer # returns the full customer row as a struct
- name: customer_name
expr: customer.c_name
- name: nation
expr: customer.nation
- name: nation_name
expr: customer.nation.n_name