approx_top_k aggregate function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 10.2 and above

Returns the top k most frequently occurring item values in an expr along with their approximate counts.

Syntax

approx_top_k(expr[, k[, maxItemsTracked]]) [FILTER ( WHERE cond ) ]

Arguments

  • expr: An expression of STRING, BOOLEAN, DATE, TIMESTAMP, or numeric type.

  • k: An optional INTEGER literal greater than 0. If k is not specified, it defaults to 5.

  • maxItemsTracked: An optional INTEGER literal greater than or equal to k. If maxItemsTracked is not specified, it defaults to 10000.

  • cond: An optional boolean expression filtering the rows used for aggregation.

Returns

Results are returned as an ARRAY of type STRUCT, where each STRUCT contains an item field for the value (with its original input type) and a count field (of type LONG) with the approximate number of occurrences. The array is sorted by count descending.

The aggregate function returns the top k most frequently occurring item values in an expression expr along with their approximate counts. The error in each count may be up to 2.0 * numRows / maxItemsTracked where numRows is the total number of rows. Higher values of maxItemsTracked provide better accuracy at the cost of increased memory usage. Expressions that have fewer than maxItemsTracked distinct items will yield exact item counts. Results include NULL values as their own item in the results.

Examples

> SELECT approx_top_k(expr) FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) AS tab(expr);
 [{'item':4,'count':2},{'item':1,'count':2},{'item':0,'count':2},{'item':3,'count':1},{'item':2,'count':1}]

> SELECT approx_top_k(expr, 2) FROM VALUES 'a', 'b', 'c', 'c', 'c', 'c', 'd', 'd' AS tab(expr);
 [{'item':'c','count',4},{'item':'d','count':2}]

> SELECT approx_top_k(expr, 10, 100) FROM VALUES (0), (1), (1), (2), (2), (2) AS tab(expr);
 [{'item':2,'count':3},{'item':1,'count':2},{'item':0,'count':1}]