approx_top_k
Returns the top k most frequently occurring item values in a string, boolean, date, timestamp, or numeric column col along with their approximate counts. The error in each count may be up to 2.0 * numRows / maxItemsTracked where numRows is the total number of rows. k (default: 5) and maxItemsTracked (default: 10000) are both integer parameters. Higher values of maxItemsTracked provide better accuracy at the cost of increased memory usage. Columns that have fewer than maxItemsTracked distinct items will yield exact item counts. NULL values are included as their own value in the results.
Results are returned as an array of structs containing item values (with their original input type) and their occurrence count (long type), sorted by count descending.
Syntax
from pyspark.databricks.sql import functions as dbsf
dbsf.approx_top_k(col, k=5, maxItemsTracked=10000)
Parameters
Parameter | Type | Description |
|---|---|---|
|
| Column to find top k items from. |
|
| Number of top items to return. Default is 5. |
|
| Maximum number of distinct items to track. Default is 10000. Higher values provide better accuracy at the cost of increased memory usage. |
Examples
from pyspark.sql.functions import col
from pyspark.databricks.sql.functions import approx_top_k
item = (col("id") % 3).alias("item")
df = spark.range(0, 1000, 1, 1).select(item)
df.select(
approx_top_k("item", 5).alias("top_k")
).printSchema()
root
|-- top_k: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- item: long (nullable = true)
| | |-- count: long (nullable = false)