Skip to main content

Window class

Utility functions for defining window in DataFrames.

Supports Spark Connect

Class attributes

Attribute

Description

unboundedPreceding

Boundary value representing the start of an unbounded window frame.

unboundedFollowing

Boundary value representing the end of an unbounded window frame.

currentRow

Boundary value representing the current row in a window frame.

Methods

Method

Description

orderBy(*cols)

Creates a WindowSpec with the ordering defined.

partitionBy(*cols)

Creates a WindowSpec with the partitioning defined.

rangeBetween(start, end)

Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive), using range-based offsets from the current row's ORDER BY value.

rowsBetween(start, end)

Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive), using row-based offsets from the current row.

Notes

When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.

Examples

Basic window with ordering and row frame

Python
from pyspark.sql import Window

# ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
window = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)

Partitioned window with range frame

Python
from pyspark.sql import Window

# PARTITION BY country ORDER BY date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING
window = Window.orderBy("date").partitionBy("country").rangeBetween(-3, 3)

Row number within partition

Python
from pyspark.sql import Window, functions as sf

df = spark.createDataFrame(
[(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")], ["id", "category"]
)

# Show row number ordered by id within each category partition
window = Window.partitionBy("category").orderBy("id")
df.withColumn("row_number", sf.row_number().over(window)).show()

Running sum with row-based frame

Python
from pyspark.sql import Window, functions as sf

df = spark.createDataFrame(
[(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")], ["id", "category"]
)

# Sum id values from the current row to the next row within each partition
window = Window.partitionBy("category").orderBy("id").rowsBetween(Window.currentRow, 1)
df.withColumn("sum", sf.sum("id").over(window)).sort("id", "category", "sum").show()

Running sum with range-based frame

Python
from pyspark.sql import Window, functions as sf

df = spark.createDataFrame(
[(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")], ["id", "category"]
)

# Sum id values from the current id value to id + 1 within each partition
window = Window.partitionBy("category").orderBy("id").rangeBetween(Window.currentRow, 1)
df.withColumn("sum", sf.sum("id").over(window)).sort("id", "category").show()