Skip to main content

approx_top_k

Returns the top k most frequently occurring item values in a string, boolean, date, timestamp, or numeric column col along with their approximate counts. The error in each count may be up to 2.0 * numRows / maxItemsTracked where numRows is the total number of rows. k (default: 5) and maxItemsTracked (default: 10000) are both integer parameters. Higher values of maxItemsTracked provide better accuracy at the cost of increased memory usage. Columns that have fewer than maxItemsTracked distinct items will yield exact item counts. NULL values are included as their own value in the results.

Results are returned as an array of structs containing item values (with their original input type) and their occurrence count (long type), sorted by count descending.

Syntax

Python
from pyspark.databricks.sql import functions as dbsf

dbsf.approx_top_k(col, k=5, maxItemsTracked=10000)

Parameters

Parameter

Type

Description

col

pyspark.sql.Column or column name

Column to find top k items from.

k

pyspark.sql.Column or int, optional

Number of top items to return. Default is 5.

maxItemsTracked

pyspark.sql.Column or int, optional

Maximum number of distinct items to track. Default is 10000. Higher values provide better accuracy at the cost of increased memory usage.

Examples

Python
from pyspark.sql.functions import col
from pyspark.databricks.sql.functions import approx_top_k
item = (col("id") % 3).alias("item")
df = spark.range(0, 1000, 1, 1).select(item)
df.select(
approx_top_k("item", 5).alias("top_k")
).printSchema()
Output
root
|-- top_k: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- item: long (nullable = true)
| | |-- count: long (nullable = false)