NEAREST BY clause

Applies to: Databricks Runtime 18 LTS and above

注記

Databricks Runtime 18 is newer than Databricks Runtime 18.0, 18.1, and 18.2. Features that would previously have shipped as a later numbered version now ship as dated updates to Databricks Runtime 18 instead. For details, see About unified release notes.

Extends a JOIN with top-k ranking on a custom distance or similarity expression. For each row in the query (left) table_reference, it finds up to top num_results matching rows from the target (right) table based on the ranking_expression, returning them as concatenated rows.

ranking_expression can be any orderable scalar expression that scores a pair of rows from the two tables, for example vector_cosine_similarity, vector_l2_distance, vector_inner_product, or a composite expression that combines several functions.

Syntax

{ INNER | LEFT [ OUTER ] } JOIN target_table_reference
  { APPROX | EXACT } NEAREST [ num_results ]
  BY { DISTANCE | SIMILARITY } ranking_expression

Parameters

target_table_reference

The target table to search. Can be a table, subquery, or CTE.
{ INNER | LEFT [ OUTER ] }

Optional. The join type. The default is INNER.
- INNER drops query rows that have no matching candidates.
- LEFT OUTER returns every query row. Target-side columns are NULL when no candidates exist, for example, when the target table is empty or every candidate is NULL. If fewer than num_results candidates exist for a query row, only the available candidates are returned.
Other join types (RIGHT, FULL, SEMI, ANTI, CROSS, NATURAL) raise NEAREST_BY_JOIN.UNSUPPORTED_JOIN_TYPE.
{ APPROX | EXACT }

Controls the result-set contract.
- EXACT returns the exact top-k rows under ranking_expression.
- APPROX returns a top-k set that approximates the exact ranking. The optimizer may use faster, approximate search strategies instead of evaluating every candidate.
NEAREST [ num_results ]

Optional positive integer literal. Defaults to 1. Must be in the range [1, 100000]. If the target table has fewer matching rows than num_results, only the available rows are returned.

Values outside the range raise NEAREST_BY_JOIN.NUM_RESULTS_OUT_OF_RANGE.
BY DISTANCE | SIMILARITY

Sets the ordering of ranking_expression.
- DISTANCE ranks rows by smallest value first (nearest = lowest distance).
- SIMILARITY ranks rows by largest value first (nearest = highest similarity).
ranking_expression

A scalar expression that can reference columns from both tables.

Common choices are:
- similarity functions like vector_cosine_similarity and vector_inner_product,
- distance functions like vector_l2_distance,
- numeric distances like the Manhattan distance: vector_norm(zip_with(a.col, b.col, (x, y) -> x - y), 1.0f).
If this expression returns a data type that does not support ordering, such as MAP, Databricks raises DATATYPE_MISMATCH.INVALID_ORDERING_TYPE.

Notes

Asymmetry

NEAREST BY is not commutative. The query side anchors the result. Each query row produces up to num_results output rows:

When 100 rows from table users join with 1,000 rows from table products with NEAREST 5, the join returns up to 500 rows.
If you switch the two sides of the join to join products with users, it returns up to 5,000 rows.

Swapping the two sides asks a different question, so the result differs even for INNER JOIN.

Streaming

NEAREST BY is not supported on streaming DataFrames or Datasets. Queries against streaming sources raise NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED.

Embedding inputs

When using vector scoring functions, both vector arguments must be ARRAY<FLOAT> with the same dimensionality. See vector_cosine_similarity function for type and NULL handling rules.

To compute embeddings from string values, use ai_query with a Databricks-hosted embedding model such as databricks-gte-large-en.

Common error conditions

DATATYPE_MISMATCH.INVALID_ORDERING_TYPE
NEAREST_BY_JOIN.NUM_RESULTS_OUT_OF_RANGE
NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED
NEAREST_BY_JOIN.UNSUPPORTED_JOIN_TYPE

Examples

The following examples use these tables. Embeddings are shown as 3-dimensional vectors for brevity; in practice they are higher-dimensional and computed by an embedding model.

SQL
> CREATE TEMP VIEW users(user_id, name, embedding) AS
    VALUES
      (1, 'Alice', ARRAY(1.0f, 0.0f, 0.0f)),
      (2, 'Bob',   ARRAY(0.0f, 1.0f, 0.0f)),
      (3, 'Carol', ARRAY(0.0f, 0.0f, 0.0f));

> CREATE TEMP VIEW products(product_id, name, price, country, embedding) AS
    VALUES
      ('P1', 'Trail running shoes', 120, 'EU', ARRAY(0.9f, 0.1f, 0.1f)),
      ('P2', 'Hiking boots',        180, 'EU', ARRAY(0.8f, 0.2f, 0.0f)),
      ('P3', 'Office shoes',         95, 'US', ARRAY(0.1f, 0.9f, 0.1f)),
      ('P4', 'Sandals',              45, 'US', ARRAY(0.0f, 0.8f, 0.2f)),
      ('P5', 'Running shoes',       110, 'EU', ARRAY(0.5f, 0.5f, 0.0f));

SQL
-- Ad-hoc vector search with an explicit query vector.
> SELECT t.product_id, t.name
    FROM (SELECT ARRAY(1.0f, 0.0f, 0.0f) AS embedding) q
    INNER JOIN products t
      APPROX NEAREST 3 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
 product_id  name
 ----------  -------------------
 P1          Trail running shoes
 P2          Hiking boots
 P5          Running shoes

-- Batch recommendations: for every user, return the 2 nearest products.
> SELECT q.user_id, q.name, t.product_id, t.name AS product
    FROM users q
    INNER JOIN products t
      APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
 user_id  name   product_id  product
 -------  -----  ----------  -------------------
 1        Alice  P1          Trail running shoes
 1        Alice  P2          Hiking boots
 2        Bob    P3          Office shoes
 2        Bob    P4          Sandals

-- Pre-filter the target table via a subquery (EU products only).
> SELECT q.user_id, q.name, t.product_id, t.name AS product, t.price
    FROM users q
    INNER JOIN (SELECT * FROM products WHERE country = 'EU') AS t
      APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
 user_id  name   product_id  product              price
 -------  -----  ----------  -------------------  -----
 1        Alice  P1          Trail running shoes  120
 1        Alice  P2          Hiking boots         180
 2        Bob    P5          Running shoes        110
 2        Bob    P2          Hiking boots         180

-- LEFT OUTER returns every query row. Carol's embedding has zero magnitude,
-- so vector_cosine_similarity returns NULL for all comparisons and her row
-- is preserved with NULL target columns.
> SELECT q.user_id, q.name, t.product_id, t.name AS product
    FROM users q
    LEFT OUTER JOIN products t
      APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
 user_id  name   product_id  product
 -------  -----  ----------  -------------------
 1        Alice  P1          Trail running shoes
 1        Alice  P2          Hiking boots
 2        Bob    P3          Office shoes
 2        Bob    P4          Sandals
 3        Carol  NULL        NULL

-- EXACT returns the exact top-k under the ranking expression.
> SELECT t.product_id, t.name
    FROM (SELECT ARRAY(1.0f, 0.0f, 0.0f) AS embedding) q
    INNER JOIN products t
      EXACT NEAREST 3 BY DISTANCE vector_l2_distance(q.embedding, t.embedding);
 product_id  name
 ----------  -------------------
 P1          Trail running shoes
 P2          Hiking boots
 P5          Running shoes

Syntax​

Parameters​

Notes​

Asymmetry​

Streaming​

Embedding inputs​

Common error conditions​

Examples​

Related articles​