Skip to main content

approx_count_distinct

Returns a new Column, which estimates the approximate distinct count of elements in a specified column or a group of columns.

Syntax

Python
from pyspark.sql import functions as sf

sf.approx_count_distinct(col, rsd=None)

Parameters

Parameter

Type

Description

col

pyspark.sql.Column or column name

The label of the column to count distinct values in.

rsd

float, optional

The maximum allowed relative standard deviation (default = 0.05). If rsd < 0.01, it would be more efficient to use count_distinct.

Returns

pyspark.sql.Column: A new Column object representing the approximate unique count.

Examples

Example 1: Counting distinct values in a single column DataFrame representing integers

Python
from pyspark.sql import functions as sf
df = spark.createDataFrame([1,2,2,3], "int")
df.agg(sf.approx_count_distinct("value")).show()
Output
+----------------------------+
|approx_count_distinct(value)|
+----------------------------+
| 3|
+----------------------------+

Example 2: Counting distinct values in a single column DataFrame representing strings

Python
from pyspark.sql import functions as sf
df = spark.createDataFrame([("apple",), ("orange",), ("apple",), ("banana",)], ['fruit'])
df.agg(sf.approx_count_distinct("fruit")).show()
Output
+----------------------------+
|approx_count_distinct(fruit)|
+----------------------------+
| 3|
+----------------------------+

Example 3: Counting distinct values in a DataFrame with multiple columns

Python
from pyspark.sql import functions as sf
df = spark.createDataFrame(
[("Alice", 1), ("Alice", 2), ("Bob", 3), ("Bob", 3)], ["name", "value"])
df = df.withColumn("combined", sf.struct("name", "value"))
df.agg(sf.approx_count_distinct(df.combined)).show()
Output
+-------------------------------+
|approx_count_distinct(combined)|
+-------------------------------+
| 3|
+-------------------------------+

Example 4: Counting distinct values with a specified relative standard deviation

Python
from pyspark.sql import functions as sf
spark.range(100000).agg(
sf.approx_count_distinct("id").alias('with_default_rsd'),
sf.approx_count_distinct("id", 0.1).alias('with_rsd_0.1')
).show()
Output
+----------------+------------+
|with_default_rsd|with_rsd_0.1|
+----------------+------------+
| 95546| 102065|
+----------------+------------+