メインコンテンツまでスキップ

NEAREST BY clause

Applies to: check marked yes Databricks Runtime 18.3 and above

Extends a JOIN with top-k ranking on a custom distance or similarity expression. For each row in the query (left) table_reference, it finds up to top num_results matching rows from the target (right) table based on the ranking_expression, returning them as concatenated rows.

ranking_expression can be any orderable scalar expression that scores a pair of rows from the two tables — for example vector_cosine_similarity, vector_l2_distance, vector_inner_product, or a composite expression that combines several functions.

Syntax

{ INNER | LEFT [ OUTER ] } JOIN target_table_reference
{ APPROX | EXACT } NEAREST [ num_results ]
BY { DISTANCE | SIMILARITY } ranking_expression

Parameters

  • target_table_reference

    The target table to search. Can be a table, subquery, or CTE.

  • { INNER | LEFT [ OUTER ] }

    Optional. The join type. The default is INNER.

    • INNER drops query rows that have no matching candidates.
    • LEFT OUTER returns every query row. Target-side columns are NULL when no candidates exist — for example, when the target table is empty or every candidate is NULL. If fewer than num_results candidates exist for a query row, only the available candidates are returned.

    Other join types (RIGHT, FULL, SEMI, ANTI, CROSS, NATURAL) raise NEAREST_BY_JOIN.UNSUPPORTED_JOIN_TYPE.

  • { APPROX | EXACT }

    Controls the result-set contract.

    • EXACT returns the exact top-k rows under ranking_expression.
    • APPROX returns a top-k set that approximates the exact ranking. The optimizer may use faster, approximate search strategies instead of evaluating every candidate.
  • NEAREST [ num_results ]

    Optional positive integer literal. Defaults to 1. Must be in the range [1, 100000]. If the target table has fewer matching rows than num_results, only the available rows are returned.

    Values outside the range raise NEAREST_BY_JOIN.NUM_RESULTS_OUT_OF_RANGE.

  • BY DISTANCE | SIMILARITY

    Sets the ordering of ranking_expression.

    • DISTANCE ranks rows by smallest value first (nearest = lowest distance).
    • SIMILARITY ranks rows by largest value first (nearest = highest similarity).
  • ranking_expression

    A scalar expression that can reference columns from both tables.

    Common choices are:

    If this expression returns a data type that does not support ordering, such as MAP, Databricks raises DATATYPE_MISMATCH.INVALID_ORDERING_TYPE.

Notes

Asymmetry

NEAREST BY is not commutative. The query side anchors the result — each query row produces up to num_results output rows:

  • When 100 rows from table users join with 1,000 rows from table products with NEAREST 5, the join returns up to 500 rows.
  • If you switch the two sides of the join to join products with users, it returns up to 5,000 rows.

Swapping the two sides asks a different question, so the result differs even for INNER JOIN.

Streaming

NEAREST BY is not supported on streaming DataFrames or Datasets. Queries against streaming sources raise NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED.

Embedding inputs

When using vector scoring functions, both vector arguments must be ARRAY<FLOAT> with the same dimensionality. See vector_cosine_similarity function for type and NULL handling rules.

To compute embeddings from string values, use ai_query with a Databricks-hosted embedding model such as databricks-gte-large-en.

Common error conditions

Examples

The following examples use these tables. Embeddings are shown as 3-dimensional vectors for brevity; in practice they are higher-dimensional and computed by an embedding model.

SQL
> CREATE TEMP VIEW users(user_id, name, embedding) AS
VALUES
(1, 'Alice', ARRAY(1.0f, 0.0f, 0.0f)),
(2, 'Bob', ARRAY(0.0f, 1.0f, 0.0f)),
(3, 'Carol', ARRAY(0.0f, 0.0f, 0.0f));

> CREATE TEMP VIEW products(product_id, name, price, country, embedding) AS
VALUES
('P1', 'Trail running shoes', 120, 'EU', ARRAY(0.9f, 0.1f, 0.1f)),
('P2', 'Hiking boots', 180, 'EU', ARRAY(0.8f, 0.2f, 0.0f)),
('P3', 'Office shoes', 95, 'US', ARRAY(0.1f, 0.9f, 0.1f)),
('P4', 'Sandals', 45, 'US', ARRAY(0.0f, 0.8f, 0.2f)),
('P5', 'Running shoes', 110, 'EU', ARRAY(0.5f, 0.5f, 0.0f));
SQL
-- Ad-hoc vector search with an explicit query vector.
> SELECT t.product_id, t.name
FROM (SELECT ARRAY(1.0f, 0.0f, 0.0f) AS embedding) q
INNER JOIN products t
APPROX NEAREST 3 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
product_id name
---------- -------------------
P1 Trail running shoes
P2 Hiking boots
P5 Running shoes

-- Batch recommendations: for every user, return the 2 nearest products.
> SELECT q.user_id, q.name, t.product_id, t.name AS product
FROM users q
INNER JOIN products t
APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
user_id name product_id product
------- ----- ---------- -------------------
1 Alice P1 Trail running shoes
1 Alice P2 Hiking boots
2 Bob P3 Office shoes
2 Bob P4 Sandals

-- Pre-filter the target table via a subquery (EU products only).
> SELECT q.user_id, q.name, t.product_id, t.name AS product, t.price
FROM users q
INNER JOIN (SELECT * FROM products WHERE country = 'EU') AS t
APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
user_id name product_id product price
------- ----- ---------- ------------------- -----
1 Alice P1 Trail running shoes 120
1 Alice P2 Hiking boots 180
2 Bob P5 Running shoes 110
2 Bob P2 Hiking boots 180

-- LEFT OUTER returns every query row. Carol's embedding has zero magnitude,
-- so vector_cosine_similarity returns NULL for all comparisons and her row
-- is preserved with NULL target columns.
> SELECT q.user_id, q.name, t.product_id, t.name AS product
FROM users q
LEFT OUTER JOIN products t
APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
user_id name product_id product
------- ----- ---------- -------------------
1 Alice P1 Trail running shoes
1 Alice P2 Hiking boots
2 Bob P3 Office shoes
2 Bob P4 Sandals
3 Carol NULL NULL

-- EXACT returns the exact top-k under the ranking expression.
> SELECT t.product_id, t.name
FROM (SELECT ARRAY(1.0f, 0.0f, 0.0f) AS embedding) q
INNER JOIN products t
EXACT NEAREST 3 BY DISTANCE vector_l2_distance(q.embedding, t.embedding);
product_id name
---------- -------------------
P1 Trail running shoes
P2 Hiking boots
P5 Running shoes