Skip to main content

percentile

Returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].

Syntax

Python
from pyspark.sql import functions as sf

sf.percentile(col, percentage, frequency=1)

Parameters

Parameter

Type

Description

col

pyspark.sql.Column or str

The numeric column.

percentage

pyspark.sql.Column, float, list of floats or tuple of floats

Percentage in decimal (must be between 0.0 and 1.0).

frequency

pyspark.sql.Column or int

A positive numeric literal which controls frequency (default: 1).

Returns

pyspark.sql.Column: the exact percentile of the numeric column.

Examples

Example 1: Calculate multiple percentiles

Python
from pyspark.sql import functions as sf
key = (sf.col("id") % 3).alias("key")
value = (sf.randn(42) + key * 10).alias("value")
df = spark.range(0, 1000, 1, 1).select(key, value)
df.select(
sf.percentile("value", [0.25, 0.5, 0.75], sf.lit(1))
).show(truncate=False)
Output
+--------------------------------------------------------+
|percentile(value, array(0.25, 0.5, 0.75), 1) |
+--------------------------------------------------------+
|[0.7441991494121..., 9.9900713756..., 19.33740203080...]|
+--------------------------------------------------------+

Example 2: Calculate percentile by group

Python
from pyspark.sql import functions as sf
key = (sf.col("id") % 3).alias("key")
value = (sf.randn(42) + key * 10).alias("value")
df = spark.range(0, 1000, 1, 1).select(key, value)
df.groupBy("key").agg(
sf.percentile("value", sf.lit(0.5), sf.lit(1))
).sort("key").show()
Output
+---+-------------------------+
|key|percentile(value, 0.5, 1)|
+---+-------------------------+
| 0| -0.03449962216667...|
| 1| 9.990389751837...|
| 2| 19.967859769284...|
+---+-------------------------+