NEAREST BY clause
Applies to: Databricks Runtime 18.3 and above
Extends a JOIN with top-k ranking on a custom distance or similarity expression. For each row in the query (left) table_reference, it finds up to top num_results matching rows from the target (right) table based on the ranking_expression, returning them as concatenated rows.
ranking_expression can be any orderable scalar expression that scores a pair of rows from the two tables — for example vector_cosine_similarity, vector_l2_distance, vector_inner_product, or a composite expression that combines several functions.
Syntax
{ INNER | LEFT [ OUTER ] } JOIN target_table_reference
{ APPROX | EXACT } NEAREST [ num_results ]
BY { DISTANCE | SIMILARITY } ranking_expression
Parameters
-
The target table to search. Can be a table, subquery, or CTE.
-
{ INNER | LEFT [ OUTER ] }Optional. The join type. The default is
INNER.INNERdrops query rows that have no matching candidates.LEFT OUTERreturns every query row. Target-side columns areNULLwhen no candidates exist — for example, when the target table is empty or every candidate isNULL. If fewer thannum_resultscandidates exist for a query row, only the available candidates are returned.
Other join types (
RIGHT,FULL,SEMI,ANTI,CROSS,NATURAL) raiseNEAREST_BY_JOIN.UNSUPPORTED_JOIN_TYPE. -
{ APPROX | EXACT }Controls the result-set contract.
EXACTreturns the exact top-k rows underranking_expression.APPROXreturns a top-k set that approximates the exact ranking. The optimizer may use faster, approximate search strategies instead of evaluating every candidate.
-
NEAREST [ num_results ]
Optional positive integer literal. Defaults to
1. Must be in the range[1, 100000]. If the target table has fewer matching rows thannum_results, only the available rows are returned.Values outside the range raise
NEAREST_BY_JOIN.NUM_RESULTS_OUT_OF_RANGE. -
BY DISTANCE | SIMILARITY
Sets the ordering of
ranking_expression.DISTANCEranks rows by smallest value first (nearest = lowest distance).SIMILARITYranks rows by largest value first (nearest = highest similarity).
-
ranking_expression
A scalar expression that can reference columns from both tables.
Common choices are:
- similarity functions like vector_cosine_similarity and vector_inner_product,
- distance functions like vector_l2_distance,
- numeric distances like the Manhattan distance:
vector_norm(zip_with(a.col, b.col, (x, y) -> x - y), 1.0f).
If this expression returns a data type that does not support ordering, such as
MAP, Databricks raises DATATYPE_MISMATCH.INVALID_ORDERING_TYPE.
Notes
Asymmetry
NEAREST BY is not commutative. The query side anchors the result — each query row produces up to num_results output rows:
- When 100 rows from table
usersjoin with 1,000 rows from tableproductswithNEAREST 5, the join returns up to 500 rows. - If you switch the two sides of the join to join
productswithusers, it returns up to 5,000 rows.
Swapping the two sides asks a different question, so the result differs even for INNER JOIN.
Streaming
NEAREST BY is not supported on streaming DataFrames or Datasets. Queries against streaming sources raise NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED.
Embedding inputs
When using vector scoring functions, both vector arguments must be ARRAY<FLOAT> with the same dimensionality. See vector_cosine_similarity function for type and NULL handling rules.
To compute embeddings from string values, use ai_query with a Databricks-hosted embedding model such as databricks-gte-large-en.
Common error conditions
- DATATYPE_MISMATCH.INVALID_ORDERING_TYPE
NEAREST_BY_JOIN.NUM_RESULTS_OUT_OF_RANGENEAREST_BY_JOIN.STREAMING_NOT_SUPPORTEDNEAREST_BY_JOIN.UNSUPPORTED_JOIN_TYPE
Examples
The following examples use these tables. Embeddings are shown as 3-dimensional vectors for brevity; in practice they are higher-dimensional and computed by an embedding model.
> CREATE TEMP VIEW users(user_id, name, embedding) AS
VALUES
(1, 'Alice', ARRAY(1.0f, 0.0f, 0.0f)),
(2, 'Bob', ARRAY(0.0f, 1.0f, 0.0f)),
(3, 'Carol', ARRAY(0.0f, 0.0f, 0.0f));
> CREATE TEMP VIEW products(product_id, name, price, country, embedding) AS
VALUES
('P1', 'Trail running shoes', 120, 'EU', ARRAY(0.9f, 0.1f, 0.1f)),
('P2', 'Hiking boots', 180, 'EU', ARRAY(0.8f, 0.2f, 0.0f)),
('P3', 'Office shoes', 95, 'US', ARRAY(0.1f, 0.9f, 0.1f)),
('P4', 'Sandals', 45, 'US', ARRAY(0.0f, 0.8f, 0.2f)),
('P5', 'Running shoes', 110, 'EU', ARRAY(0.5f, 0.5f, 0.0f));
-- Ad-hoc vector search with an explicit query vector.
> SELECT t.product_id, t.name
FROM (SELECT ARRAY(1.0f, 0.0f, 0.0f) AS embedding) q
INNER JOIN products t
APPROX NEAREST 3 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
product_id name
---------- -------------------
P1 Trail running shoes
P2 Hiking boots
P5 Running shoes
-- Batch recommendations: for every user, return the 2 nearest products.
> SELECT q.user_id, q.name, t.product_id, t.name AS product
FROM users q
INNER JOIN products t
APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
user_id name product_id product
------- ----- ---------- -------------------
1 Alice P1 Trail running shoes
1 Alice P2 Hiking boots
2 Bob P3 Office shoes
2 Bob P4 Sandals
-- Pre-filter the target table via a subquery (EU products only).
> SELECT q.user_id, q.name, t.product_id, t.name AS product, t.price
FROM users q
INNER JOIN (SELECT * FROM products WHERE country = 'EU') AS t
APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
user_id name product_id product price
------- ----- ---------- ------------------- -----
1 Alice P1 Trail running shoes 120
1 Alice P2 Hiking boots 180
2 Bob P5 Running shoes 110
2 Bob P2 Hiking boots 180
-- LEFT OUTER returns every query row. Carol's embedding has zero magnitude,
-- so vector_cosine_similarity returns NULL for all comparisons and her row
-- is preserved with NULL target columns.
> SELECT q.user_id, q.name, t.product_id, t.name AS product
FROM users q
LEFT OUTER JOIN products t
APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
user_id name product_id product
------- ----- ---------- -------------------
1 Alice P1 Trail running shoes
1 Alice P2 Hiking boots
2 Bob P3 Office shoes
2 Bob P4 Sandals
3 Carol NULL NULL
-- EXACT returns the exact top-k under the ranking expression.
> SELECT t.product_id, t.name
FROM (SELECT ARRAY(1.0f, 0.0f, 0.0f) AS embedding) q
INNER JOIN products t
EXACT NEAREST 3 BY DISTANCE vector_l2_distance(q.embedding, t.embedding);
product_id name
---------- -------------------
P1 Trail running shoes
P2 Hiking boots
P5 Running shoes