Skip to main content

freqItems (DataFrame)

Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in "https://doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou". DataFrame.freqItems and DataFrameStatFunctions.freqItems are aliases.

Syntax

freqItems(cols: Union[List[str], Tuple[str]], support: Optional[float] = None)

Parameters

Parameter

Type

Description

cols

list or tuple

Names of the columns to calculate frequent items for as a list or tuple of strings.

support

float, optional

The frequency with which to consider an item 'frequent'. Default is 1%. The support must be greater than 1e-4.

Returns

DataFrame: DataFrame with frequent items.

Notes

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

Examples

Python
from pyspark.sql import functions as sf
df = spark.createDataFrame([(1, 11), (1, 11), (3, 10), (4, 8), (4, 8)], ["c1", "c2"])
df = df.freqItems(["c1", "c2"])
df.select([sf.sort_array(c).alias(c) for c in df.columns]).show()
# +------------+------------+
# |c1_freqItems|c2_freqItems|
# +------------+------------+
# | [1, 3, 4]| [8, 10, 11]|
# +------------+------------+