Skip to main content

approxQuantile (DataFrameStatFunctions)

Calculates the approximate quantiles of numerical columns of a DataFrame.

The result of this algorithm has the following deterministic bound: if the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the exact rank of x is close to (p _ N). More precisely, floor((p - err) _ N) <= rank(x) <= ceil((p + err) \* N).

This method implements a variation of the Greenwald-Khanna algorithm with some speed optimizations.

Syntax

approxQuantile(col, probabilities, relativeError)

Parameters

Parameter

Type

Description

col

str, list, or tuple

A single column name, or a list of names for multiple columns.

probabilities

list or tuple of float

A list of quantile probabilities. Each number must be a float in the range [0, 1]. For example, 0.0 is the minimum, 0.5 is the median, and 1.0 is the maximum.

relativeError

float

The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Values greater than 1 give the same result as 1.

Returns

list

If col is a string, returns a list of floats. If col is a list or tuple of strings, returns a list of lists of floats.

Notes

Null values are ignored in numerical columns before calculation. For columns containing only null values, an empty list is returned.

Examples

Calculate quantiles for a single column.

Python
data = [(1,), (2,), (3,), (4,), (5,)]
df = spark.createDataFrame(data, ["values"])
df.stat.approxQuantile("values", [0.0, 0.5, 1.0], 0.05)
# [1.0, 3.0, 5.0]

Calculate quantiles for multiple columns.

Python
data = [(1, 10), (2, 20), (3, 30), (4, 40), (5, 50)]
df = spark.createDataFrame(data, ["col1", "col2"])
df.stat.approxQuantile(["col1", "col2"], [0.0, 0.5, 1.0], 0.05)
# [[1.0, 3.0, 5.0], [10.0, 30.0, 50.0]]

Handle null values.

Python
data = [(1,), (None,), (3,), (4,), (None,)]
df = spark.createDataFrame(data, ["values"])
df.stat.approxQuantile("values", [0.0, 0.5, 1.0], 0.05)
# [1.0, 3.0, 4.0]