Skip to main content

groupBy

Groups the DataFrame by the specified columns so that aggregation can be performed on them. See GroupedData for all the available aggregate functions.

Syntax

groupBy(*cols: "ColumnOrNameOrOrdinal")

Parameters

Parameter

Type

Description

cols

list, str, int or Column

The columns to group by. Each element can be a column name (string) or an expression (Column) or a column ordinal (int, 1-based) or list of them.

Returns

GroupedData: A GroupedData object representing the grouped data by the specified columns.

Notes

A column ordinal starts from 1, which is different from the 0-based __getitem__.

Examples

Python
df = spark.createDataFrame([
("Alice", 2), ("Bob", 2), ("Bob", 2), ("Bob", 5)], schema=["name", "age"])

df.groupBy().avg().show()
# +--------+
# |avg(age)|
# +--------+
# | 2.75|
# +--------+

df.groupBy("name").agg({"age": "sum"}).sort("name").show()
# +-----+--------+
# | name|sum(age)|
# +-----+--------+
# |Alice| 2|
# | Bob| 9|
# +-----+--------+

df.groupBy(df.name).max().sort("name").show()
# +-----+--------+
# | name|max(age)|
# +-----+--------+
# |Alice| 2|
# | Bob| 5|
# +-----+--------+

df.groupBy(["name", df.age]).count().sort("name", "age").show()
# +-----+---+-----+
# | name|age|count|
# +-----+---+-----+
# |Alice| 2| 1|
# | Bob| 2| 2|
# | Bob| 5| 1|
# +-----+---+-----+