Pular para o conteúdo principal

theta_sketch_agg aggregate function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 18.0 and above

Creates a Datasketches Theta Sketch from input values for approximate distinct count estimation. The Theta Sketch algorithm provides probabilistic counting of unique values with configurable accuracy.

Syntax

theta_sketch_agg ( expr [, lgNomEntries ] )

Arguments

  • expr: The expression to aggregate. Accepted types are INTEGER, LONG, FLOAT, DOUBLE, STRING, BINARY, ARRAY<INTEGER>, and ARRAY<LONG>.
  • lgNomEntries: An optional INTEGER literal specifying the log-base-2 of the nominal entries (number of buckets). Must be between 4 and 26, inclusive. The default is 12 (4,096 buckets). Higher values provide better accuracy but use more memory.

Returns

A BINARY value containing the serialized compact Theta Sketch.

Notes

  • NULL values are ignored during aggregation.
  • Empty strings, empty byte arrays, and empty arrays are ignored.
  • The lgNomEntries parameter must be a constant value.
  • The actual number of buckets in the sketch is 2^lgNomEntries (for example, lgNomEntries=12 means 4,096 buckets).
  • Use theta_sketch_estimate function to obtain the distinct count estimate from the resulting sketch.

Error messages

Examples

SQL
-- Create sketch with default lgNomEntries=12 and get estimate
> SELECT theta_sketch_estimate(theta_sketch_agg(col)) FROM VALUES (1), (1), (2), (2), (3) tab(col);
3

-- Create sketch with custom lgNomEntries=20 for higher accuracy
> SELECT theta_sketch_estimate(theta_sketch_agg(col, 20)) FROM VALUES (1), (2), (3), (4), (5) tab(col);
5