Skip to main content

count_min_sketch

Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

Syntax

Python
from pyspark.sql import functions as sf

sf.count_min_sketch(col, eps, confidence, seed=None)

Parameters

Parameter

Type

Description

col

pyspark.sql.Column or str

Target column to compute on.

eps

pyspark.sql.Column or float

Relative error, must be positive.

confidence

pyspark.sql.Column or float

Confidence, must be positive and less than 1.0.

seed

pyspark.sql.Column or int, optional

Random seed.

Returns

pyspark.sql.Column: count-min sketch of the column

Examples

Example 1: Using columns as arguments

Python
from pyspark.sql import functions as sf
spark.range(100).select(
sf.hex(sf.count_min_sketch(sf.col("id"), sf.lit(3.0), sf.lit(0.1), sf.lit(1)))
).show(truncate=False)
Output
+------------------------------------------------------------------------+
|hex(count_min_sketch(id, 3.0, 0.1, 1)) |
+------------------------------------------------------------------------+
|0000000100000000000000640000000100000001000000005D8D6AB90000000000000064|
+------------------------------------------------------------------------+

Example 2: Using numbers as arguments

Python
from pyspark.sql import functions as sf
spark.range(100).select(
sf.hex(sf.count_min_sketch("id", 1.0, 0.3, 2))
).show(truncate=False)
Output
+----------------------------------------------------------------------------------------+
|hex(count_min_sketch(id, 1.0, 0.3, 2)) |
+----------------------------------------------------------------------------------------+
|0000000100000000000000640000000100000002000000005D96391C00000000000000320000000000000032|
+----------------------------------------------------------------------------------------+

Example 3: Using a long seed

Python
from pyspark.sql import functions as sf
spark.range(100).select(
sf.hex(sf.count_min_sketch("id", sf.lit(1.5), 0.2, 1111111111111111111))
).show(truncate=False)
Output
+----------------------------------------------------------------------------------------+
|hex(count_min_sketch(id, 1.5, 0.2, 1111111111111111111)) |
+----------------------------------------------------------------------------------------+
|00000001000000000000006400000001000000020000000044078BA100000000000000320000000000000032|
+----------------------------------------------------------------------------------------+