Pular para o conteúdo principal

approx_top_k_accumulate aggregate function

Applies to: check marked yes Databricks Runtime 18.1 and above

Accumulates items into an Apache DataSketches ItemsSketch for approximate frequent items (top-K) estimation. The sketch can later be combined with other sketches using approx_top_k_combine or estimated directly using approx_top_k_estimate.

Syntax

approx_top_k_accumulate ( expr [, maxItemsTracked ] )

Arguments

  • expr: The expression to accumulate. Accepted types are BOOLEAN, BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DATE, TIMESTAMP, TIMESTAMP_NTZ, STRING, and DECIMAL.
  • maxItemsTracked: An optional positive INTEGER literal specifying the maximum number of distinct items to track in the sketch. Must be between 1 and 1,000,000, inclusive. The default is 10,000. Higher values provide better accuracy for finding frequent items but use more memory.

Returns

A STRUCT value containing the serialized sketch state with the following fields:

  • sketch: BINARY — the serialized ItemsSketch
  • maxItemsTracked: INTEGER — the maximum items tracked setting
  • itemDataType: the data type of the items (for type safety)
  • itemDataTypeDDL: STRING — DDL representation of the item data type

Notes

  • NULL values are tracked separately and included in results when they are among the most frequent.
  • The maxItemsTracked parameter must be a constant value.
  • The sketch uses the ItemsSketch algorithm from the Apache DataSketches library.
  • Use approx_top_k_estimate to obtain the top-K frequent items from the resulting sketch.
  • Use approx_top_k_combine to merge multiple sketches before estimation.

Error messages

Examples

SQL
-- Create sketch with default maxItemsTracked=10000 and get top-5 items
> SELECT approx_top_k_estimate(approx_top_k_accumulate(expr)) FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) AS tab(expr);
[{"item":0,"count":2},{"item":4,"count":2},{"item":1,"count":2},{"item":2,"count":1},{"item":3,"count":1}]

-- Create sketch with custom maxItemsTracked=100 and get top-2 items
> SELECT approx_top_k_estimate(approx_top_k_accumulate(expr, 100), 2) FROM VALUES ('a'), ('b'), ('c'), ('c'), ('c'), ('c'), ('d'), ('d') AS tab(expr);
[{"item":"c","count":4},{"item":"d","count":2}]