percentile_approx

Returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.

Syntax

Python
from pyspark.sql import functions as sf

sf.percentile_approx(col, percentage, accuracy=10000)

Parameters

Parameter	Type	Description
`col`	`pyspark.sql.Column` or str	Input column.
`percentage`	`pyspark.sql.Column`, float, list of floats or tuple of floats	Percentage in decimal (must be between 0.0 and 1.0). When percentage is an array, each value must be between 0.0 and 1.0.
`accuracy`	`pyspark.sql.Column` or int	A positive numeric literal which controls approximation accuracy at the cost of memory. Higher value yields better accuracy. 1.0/accuracy is the relative error (default: 10000).

Returns

pyspark.sql.Column: approximate percentile of the numeric column.

Examples

Example 1: Calculate approximate percentiles

Python
from pyspark.sql import functions as sf
key = (sf.col("id") % 3).alias("key")
value = (sf.randn(42) + key * 10).alias("value")
df = spark.range(0, 1000, 1, 1).select(key, value)
df.select(
    sf.percentile_approx("value", [0.25, 0.5, 0.75], 1000000)
).show(truncate=False)

Output
+----------------------------------------------------------+
|percentile_approx(value, array(0.25, 0.5, 0.75), 1000000) |
+----------------------------------------------------------+
|[0.7264430125286..., 9.98975299938..., 19.335304783039...]|
+----------------------------------------------------------+

Example 2: Calculate approximate percentile by group

Python
from pyspark.sql import functions as sf
key = (sf.col("id") % 3).alias("key")
value = (sf.randn(42) + key * 10).alias("value")
df = spark.range(0, 1000, 1, 1).select(key, value)
df.groupBy("key").agg(
    sf.percentile_approx("value", sf.lit(0.5), sf.lit(1000000))
).sort("key").show()

Output
+---+--------------------------------------+
|key|percentile_approx(value, 0.5, 1000000)|
+---+--------------------------------------+
|  0|                  -0.03519435193070...|
|  1|                     9.990389751837...|
|  2|                    19.967859769284...|
+---+--------------------------------------+

Syntax​

Parameters​

Returns​

Examples​

Syntax

Parameters

Returns

Examples