percentile_approx
Returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.
Syntax
from pyspark.sql import functions as sf
sf.percentile_approx(col, percentage, accuracy=10000)
Parameters
Parameter | Type | Description |
|---|---|---|
|
| Input column. |
|
| Percentage in decimal (must be between 0.0 and 1.0). When percentage is an array, each value must be between 0.0 and 1.0. |
|
| A positive numeric literal which controls approximation accuracy at the cost of memory. Higher value yields better accuracy. 1.0/accuracy is the relative error (default: 10000). |
Returns
pyspark.sql.Column: approximate percentile of the numeric column.
Examples
Example 1: Calculate approximate percentiles
from pyspark.sql import functions as sf
key = (sf.col("id") % 3).alias("key")
value = (sf.randn(42) + key * 10).alias("value")
df = spark.range(0, 1000, 1, 1).select(key, value)
df.select(
sf.percentile_approx("value", [0.25, 0.5, 0.75], 1000000)
).show(truncate=False)
+----------------------------------------------------------+
|percentile_approx(value, array(0.25, 0.5, 0.75), 1000000) |
+----------------------------------------------------------+
|[0.7264430125286..., 9.98975299938..., 19.335304783039...]|
+----------------------------------------------------------+
Example 2: Calculate approximate percentile by group
from pyspark.sql import functions as sf
key = (sf.col("id") % 3).alias("key")
value = (sf.randn(42) + key * 10).alias("value")
df = spark.range(0, 1000, 1, 1).select(key, value)
df.groupBy("key").agg(
sf.percentile_approx("value", sf.lit(0.5), sf.lit(1000000))
).sort("key").show()
+---+--------------------------------------+
|key|percentile_approx(value, 0.5, 1000000)|
+---+--------------------------------------+
| 0| -0.03519435193070...|
| 1| 9.990389751837...|
| 2| 19.967859769284...|
+---+--------------------------------------+