Skip to main content

theta_union_agg

Aggregate function: returns the compact binary representation of the Datasketches Theta Sketch that is the union of the Theta sketches in the input column.

Syntax

Python
from pyspark.databricks.sql import functions as dbf

dbf.theta_union_agg(col=<col>, lgNomEntries=<lgNomEntries>)

Parameters

Parameter

Type

Description

col

pyspark.sql.Column or column name

The column containing Theta sketches to union.

lgNomEntries

pyspark.sql.Column or int, optional

The log-base-2 of nominal entries for the union operation (must be between 4 and 26, defaults to 12).

Returns

pyspark.sql.Column: The binary representation of the merged Theta Sketch.

Examples

Python
from pyspark.databricks.sql import functions as dbf
df1 = spark.createDataFrame([1,2,2,3], "INT")
df1 = df1.agg(dbf.theta_sketch_agg("value").alias("sketch"))
df2 = spark.createDataFrame([4,5,5,6], "INT")
df2 = df2.agg(dbf.theta_sketch_agg("value").alias("sketch"))
df3 = df1.union(df2)
df3.agg(dbf.theta_sketch_estimate(dbf.theta_union_agg("sketch"))).show()
Output
+--------------------------------------------------+
|theta_sketch_estimate(theta_union_agg(sketch, 12))|
+--------------------------------------------------+
| 6|
+--------------------------------------------------+