approx_top_k_accumulate aggregate function
Applies to: Databricks Runtime 18.1 and above
Accumulates items into an Apache DataSketches ItemsSketch for approximate frequent items (top-K) estimation. The sketch can later be combined with other sketches using approx_top_k_combine or estimated directly using approx_top_k_estimate.
Syntax
approx_top_k_accumulate ( expr [, maxItemsTracked ] )
Arguments
- expr: The expression to accumulate. Accepted types are
BOOLEAN,BYTE,SHORT,INTEGER,LONG,FLOAT,DOUBLE,DATE,TIMESTAMP,TIMESTAMP_NTZ,STRING, andDECIMAL. - maxItemsTracked: An optional positive
INTEGERliteral specifying the maximum number of distinct items to track in the sketch. Must be between 1 and 1,000,000, inclusive. The default is 10,000. Higher values provide better accuracy for finding frequent items but use more memory.
Returns
A STRUCT value containing the serialized sketch state with the following fields:
- sketch:
BINARY— the serialized ItemsSketch - maxItemsTracked:
INTEGER— the maximum items tracked setting - itemDataType: the data type of the items (for type safety)
- itemDataTypeDDL:
STRING— DDL representation of the item data type
Notes
NULLvalues are tracked separately and included in results when they are among the most frequent.- The
maxItemsTrackedparameter must be a constant value. - The sketch uses the ItemsSketch algorithm from the Apache DataSketches library.
- Use approx_top_k_estimate to obtain the top-K frequent items from the resulting sketch.
- Use approx_top_k_combine to merge multiple sketches before estimation.
Error messages
Examples
SQL
-- Create sketch with default maxItemsTracked=10000 and get top-5 items
> SELECT approx_top_k_estimate(approx_top_k_accumulate(expr)) FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) AS tab(expr);
[{"item":0,"count":2},{"item":4,"count":2},{"item":1,"count":2},{"item":2,"count":1},{"item":3,"count":1}]
-- Create sketch with custom maxItemsTracked=100 and get top-2 items
> SELECT approx_top_k_estimate(approx_top_k_accumulate(expr, 100), 2) FROM VALUES ('a'), ('b'), ('c'), ('c'), ('c'), ('c'), ('d'), ('d') AS tab(expr);
[{"item":"c","count":4},{"item":"d","count":2}]