theta_sketch_agg aggregate function
Applies to: Databricks SQL
Databricks Runtime 18.0 and above
Creates a Datasketches Theta Sketch from input values for approximate distinct count estimation. The Theta Sketch algorithm provides probabilistic counting of unique values with configurable accuracy.
Syntax
theta_sketch_agg ( expr [, lgNomEntries ] )
Arguments
- expr: The expression to aggregate. Accepted types are
INTEGER,LONG,FLOAT,DOUBLE,STRING,BINARY,ARRAY<INTEGER>, andARRAY<LONG>. - lgNomEntries: An optional
INTEGERliteral specifying the log-base-2 of the nominal entries (number of buckets). Must be between 4 and 26, inclusive. The default is 12 (4,096 buckets). Higher values provide better accuracy but use more memory.
Returns
A BINARY value containing the serialized compact Theta Sketch.
Notes
NULLvalues are ignored during aggregation.- Empty strings, empty byte arrays, and empty arrays are ignored.
- The
lgNomEntriesparameter must be a constant value. - The actual number of buckets in the sketch is 2^lgNomEntries (for example, lgNomEntries=12 means 4,096 buckets).
- Use
theta_sketch_estimatefunction to obtain the distinct count estimate from the resulting sketch.
Error messages
Examples
SQL
-- Create sketch with default lgNomEntries=12 and get estimate
> SELECT theta_sketch_estimate(theta_sketch_agg(col)) FROM VALUES (1), (1), (2), (2), (3) tab(col);
3
-- Create sketch with custom lgNomEntries=20 for higher accuracy
> SELECT theta_sketch_estimate(theta_sketch_agg(col, 20)) FROM VALUES (1), (2), (3), (4), (5) tab(col);
5