Skip to main content

Joins in metric views

This page describes how to use joins in metric views to enrich your source data with attributes from related tables.

Joins in metric views support direct joins from a fact table to dimension tables (star schema), multi-hop joins across normalized dimension tables (snowflake schema), and one-to-many joins that aggregate facts from related tables. By default, all joins are many-to-one, meaning each source row matches at most one row in the joined table.

Star schema joins

In a star schema, the source is the fact table and joins with one or more dimension tables using a LEFT OUTER JOIN. Metric views join the fact and dimension tables needed for the specific query, based on the selected fields and measures.

Specify join columns using either an on clause (Boolean expression) or a using clause (shared column names). The join should follow a many-to-one relationship. In cases of many-to-many, the first matching row from the joined dimension table is selected.

The following example joins the orders fact table to the customer dimension table with an on clause, which takes a Boolean expression:

YAML
version: 1.1
source: samples.tpch.orders

joins:
# The on clause supports a Boolean expression
- name: customer
source: samples.tpch.customer
on: source.o_custkey = customer.c_custkey

fields:
# Field referencing a join column using dot notation
- name: Customer name
expr: customer.c_name
- name: Customer market segment
expr: customer.c_mktsegment

measures:
# Measure referencing a join column
- name: Total revenue
expr: SUM(o_totalprice)
- name: Order count
expr: COUNT(1)

When the join columns have the same name in both tables, use a using clause instead of an on clause. The using clause takes an array of column names that exist in both the source and the joined table. No dataset in the samples catalog has tables that share a join column name, so the following example uses placeholder table and column names to illustrate the syntax:

YAML
joins:
- name: customer
source: catalog.schema.customer
using:
- customer_id
note

In an on clause, source refers to the metric view's source table and the join name refers to columns from the joined table. For example, source.o_custkey = customer.c_custkey joins the source table's o_custkey column to the customer table's c_custkey column. If no prefix is provided, the reference defaults to the joined table.

Snowflake schema joins

A snowflake schema extends a star schema by normalizing dimension tables and connecting them to subdimensions. This creates a multi-level join structure.

To define a snowflake schema:

  1. Create a metric view.
  2. Add first-level (star schema) joins.
  3. Join with other dimension tables.
  4. Expose nested attributes by adding fields in your view.

The following example uses the TPC-H dataset to illustrate a snowflake schema showing the geographic hierarchy of orders. The example joins the orders table to customers, then to their nations (countries), and finally to their regions (continents). The TPC-H dataset is available in the samples catalog in your Databricks workspace.

YAML
source: samples.tpch.orders

joins:
- name: customer
source: samples.tpch.customer
on: source.o_custkey = customer.c_custkey
joins:
- name: nation
source: samples.tpch.nation
on: customer.c_nationkey = nation.n_nationkey
joins:
- name: region
source: samples.tpch.region
on: nation.n_regionkey = region.r_regionkey

fields:
- name: clerk
expr: o_clerk
- name: customer
expr: customer
comment: returns the full customer row as a struct
- name: customer_name
expr: customer.c_name
- name: nation
expr: customer.nation
- name: nation_name
expr: customer.nation.n_name

Join cardinality

The cardinality field on a join controls the relationship between the source table and the joined table. This field determines how the engine processes measures that reference columns from the joined table.

The following table compares the two supported cardinalities:

Property

many_to_one (default)

one_to_many

Rows matched per source row

At most one

Zero or more

Typical use

Dimension lookup

Fact expansion

Allowed in fields

Yes

No

Allowed in measures

Yes

Yes

Many-to-one joins

Many-to-one is the default cardinality. Each row in the source matches at most one row in the joined table, so the joined table acts as a dimension lookup. You can omit the cardinality field for many-to-one joins, or state cardinality: many_to_one explicitly.

Both fields and measures can reference columns from a many-to-one join using dot notation (for example, customer.c_name).

Declare join constraints with rely

Setting rely.at_most_one_match: true declares that the join has no fan-out on the "one" side:

  • On a many-to-one join, each source row matches at most one row in the joined table.
  • On a one-to-many join, each joined row matches at most one source row.

This declaration allows the engine to skip unnecessary joins and reduce data scanned, especially for queries that filter on fields from the joined table. Databricks recommends setting rely on both cardinalities when the constraint holds.

warning

Set at_most_one_match: true only when the relationship genuinely holds. This property is not validated at runtime. If the side asserted to be unique produces a fan-out, measures return incorrect results.

The following example joins orders to customer with rely enabled:

YAML
version: 1.1
source: samples.tpch.orders

joins:
- name: customer
source: samples.tpch.customer
on: source.o_custkey = customer.c_custkey
rely:
at_most_one_match: true

fields:
- name: Customer name
expr: customer.c_name
- name: Customer market segment
expr: customer.c_mktsegment

measures:
- name: Total revenue
expr: SUM(o_totalprice)
- name: Order count
expr: COUNT(1)

See Optimize joins with rely for the complete rely field reference.

One-to-many joins

Set cardinality: one_to_many to allow a single source row to match multiple rows in the joined table. This turns that table into a fact source that the engine aggregates independently at the source grain.

note

One-to-many joins require Databricks Runtime 18.1 or later and YAML specification version 1.1. See Metric view feature availability.

A one-to-many join allows a single metric view to measure facts that live at different grains, such as orders per customer or events per account, without duplicating the source rows in query results. The source acts as the dimensional spine: each entity appears exactly once, regardless of how many matching rows exist in the joined table.

One-to-many join example

The following example uses customer as the source and joins orders with cardinality: one_to_many. A many_to_one join to nation supplies the nation_name field. Qualify the source side of each join condition with source. so the reference resolves to the metric view's source table. Both joins set rely.at_most_one_match: true: on the nation join it asserts each customer has at most one nation, and on the orders join it asserts each order belongs to at most one customer. See Declare join constraints with rely.

YAML
version: 1.1
source: samples.tpch.customer

joins:
- name: nation
source: samples.tpch.nation
on: nation.n_nationkey = source.c_nationkey
rely:
at_most_one_match: true
- name: orders
source: samples.tpch.orders
on: orders.o_custkey = source.c_custkey
cardinality: one_to_many
rely:
at_most_one_match: true

fields:
- name: customer_name
expr: c_name
- name: nation_name
expr: nation.n_name

measures:
- name: customer_count
expr: count(*)
- name: order_count
expr: count(orders.o_orderkey)
- name: total_order_revenue
expr: sum(orders.o_totalprice)

In this view, customer_count counts rows in the source customer table, while order_count and total_order_revenue aggregate rows from the orders branch. A customer with two orders returns an order_count of 2 while customer_count stays 1, which confirms that the source rows are not duplicated. A customer with no orders still appears in results, with an order_count of 0 and a NULL total_order_revenue.

Nested one-to-many joins

To measure facts that are two or more levels below the source, nest one-to-many joins. All joins in a one-to-many subtree must share the same cardinality, so a one-to-many parent can't have a many-to-one child. Reference a column in a nested join with its full dot-path through the join names.

The following example nests lineitem under orders so that a single customer-grain view can count both orders and line items:

YAML
version: 1.1
source: samples.tpch.customer

joins:
- name: orders
source: samples.tpch.orders
on: orders.o_custkey = source.c_custkey
cardinality: one_to_many
joins:
- name: lineitem
source: samples.tpch.lineitem
on: lineitem.l_orderkey = orders.o_orderkey
cardinality: one_to_many

fields:
- name: customer_name
expr: c_name

measures:
- name: order_count
expr: count(distinct orders.o_orderkey)
- name: line_item_count
expr: count(orders.lineitem.l_linenumber)
- name: total_line_revenue
expr: sum(orders.lineitem.l_extendedprice)

The measures reference nested columns with their full dot-path through the join names, such as orders.lineitem.l_extendedprice, because lineitem is reachable only through orders. Use count(distinct orders.o_orderkey) rather than a plain count for the order count: each order fans out into multiple line items, so a plain count would count an order once per line item.

Sibling one-to-many joins

Define multiple one-to-many joins at the same level to measure independent fact sources from a single view. Sibling joins are aggregated separately and then blended, so their rows never cross-multiply. Top-level siblings can mix cardinalities freely, so a many_to_one dimension join and a one_to_many fact join can coexist at the same level.

The following example uses nation as the source and adds two independent one-to-many branches, customer and supplier:

YAML
version: 1.1
source: samples.tpch.nation

joins:
- name: customer
source: samples.tpch.customer
on: customer.c_nationkey = source.n_nationkey
cardinality: one_to_many
- name: supplier
source: samples.tpch.supplier
on: supplier.s_nationkey = source.n_nationkey
cardinality: one_to_many

fields:
- name: nation_name
expr: n_name

measures:
- name: customer_count
expr: count(customer.c_custkey)
- name: supplier_count
expr: count(supplier.s_suppkey)
- name: customers_per_supplier
expr: count(customer.c_custkey) / count(supplier.s_suppkey)

The customers_per_supplier measure divides two independent aggregations after the engine blends each one to the query grain. You can combine measures from different sources with arithmetic, but a single aggregation function must reference columns from only one source.

Connect multiple fact tables with a bridge table

A metric view models a single fact table joined to dimension tables. To combine measures from two or more fact tables that sit at different grains, define a bridge that enumerates the valid combinations of the dimensions the facts share, directly in the metric view's source. For example, the samples.tpch shipping fact lineitem (grain: order line) and the supply fact partsupp (grain: part and supplier) both share the part and supplier dimensions.

A bridge makes the set of valid dimension combinations explicit, so query results stay predictable. The metric view returns only the combinations you declare valid rather than inferring them for each query. Set cardinality: one_to_many on each fact join so the engine aggregates each fact independently against the shared bridge without fanning out and double-counting.

To build the bridge, define it as a SQL query in the metric view source, join each fact table to it on its shared columns, then declare fields on the shared dimension columns and measures on each fact. Use a CROSS JOIN when every combination of the shared dimensions is valid:

YAML
version: 1.1
source: SELECT * FROM samples.tpch.part CROSS JOIN samples.tpch.supplier
filter: s_suppkey IN (11315, 42920) AND p_partkey IN (30419, 80418)

joins:
- name: lineitem
source: samples.tpch.lineitem
on: source.p_partkey = lineitem.l_partkey AND source.s_suppkey = lineitem.l_suppkey
cardinality: one_to_many
- name: partsupp
source: samples.tpch.partsupp
on: source.p_partkey = partsupp.ps_partkey AND source.s_suppkey = partsupp.ps_suppkey
cardinality: one_to_many

fields:
- name: part_name
expr: p_name
- name: part_brand
expr: p_brand
- name: part_type
expr: p_type
- name: part_size
expr: p_size
- name: manufacturer
expr: p_mfgr
- name: supplier_name
expr: s_name

measures:
- name: lineitem_count
expr: COUNT(lineitem.*)
- name: total_quantity_sold
expr: SUM(lineitem.l_quantity)
- name: gross_revenue
expr: SUM(lineitem.l_extendedprice)
- name: net_revenue
expr: SUM(lineitem.l_extendedprice * (1 - lineitem.l_discount))
- name: distinct_orders
expr: COUNT(DISTINCT lineitem.l_orderkey)
- name: available_quantity
expr: SUM(partsupp.ps_availqty)
- name: avg_supply_cost
expr: AVG(partsupp.ps_supplycost)
- name: total_supply_value
expr: SUM(partsupp.ps_availqty * partsupp.ps_supplycost)

A measure over a fact table counts only the records whose shared-dimension values appear in the bridge. Combinations that the bridge doesn't include don't contribute to the results.

When you only want the combinations that actually occur, swap the source for a UNION (or FULL OUTER JOIN) of the distinct pairs from each fact so that each fact contributes its unique members. The joins, fields, and measures stay the same:

YAML
source: |
SELECT DISTINCT l_partkey AS p_partkey, l_suppkey AS s_suppkey FROM samples.tpch.lineitem
UNION
SELECT DISTINCT ps_partkey AS p_partkey, ps_suppkey AS s_suppkey FROM samples.tpch.partsupp

One-to-many join restrictions

  • Fields can't reference a one-to-many join: A field must resolve to exactly one value per source row. Because a one-to-many column can have multiple values per source row, you can't use it in a fields definition. To use such a column as a field, make that table the source and join the original source as a many_to_one join instead.
  • A single aggregation can't span sources: Each aggregation function must reference columns from one source. Arithmetic between the results of two aggregations is allowed, such as count(orders.o_orderkey) / count(*), but a single function can't combine columns from two sources.
  • A join subtree can't mix cardinalities: All descendants of a one-to-many join must also be one-to-many, and all descendants of a many-to-one join must be many-to-one. Only top-level siblings can mix cardinalities.

Next steps