count_min_sketch aggregate function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Returns a count-min sketch of all values in the group in column with the epsilon, confidence and seed.

Since Databricks Runtime 13.3 this function supports named parameter invocation.

Syntax

count_min_sketch ( [ALL | DISTINCT] column, epsilon, confidence, seed ) [FILTER ( WHERE cond ) ]

This function can also be invoked as a window function using the OVER clause.

Arguments

  • column: An expression that evaluates to an integral numeric, STRING, or BINARY.

  • epsilon: A DOUBLE literal greater than 0 describing the relative error.

  • confidence: A DOUBLE literal greater than 0 and less than 1.

  • seed: An INTEGER literal.

  • cond: An optional boolean expression filtering the rows used for aggregation.

Returns

A BINARY.

Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

If DISTINCT is specified the function operates only on a unique set of expr values.

Examples

-- Named parameter invocation
> SELECT hex(count_min_sketch(column => col, confidence => 0.5d, epsilon => 0.5d, seed => 1)) FROM VALUES (1), (2), (1) AS tab(col);
0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000000000000000

> SELECT hex(count_min_sketch(DISTINCT col, 0.5d, 0.5d, 1)) FROM VALUES (1), (2), (1) AS tab(col);
0000000100000000000000020000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000010000000000000000