window

Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported.

The time column must be of pyspark.sql.types.TimestampType.

Durations are provided as strings, e.g. '1 second', '1 day 12 hours', '2 minutes'. Valid interval strings are 'week', 'day', 'hour', 'minute', 'second', 'millisecond', 'microsecond'. If the slideDuration is not provided, the windows will be tumbling windows.

The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes.

The output column will be a struct called 'window' by default with the nested columns 'start' and 'end', where 'start' and 'end' will be of pyspark.sql.types.TimestampType.

For the corresponding Databricks SQL function, see window grouping expression.

Syntax

Python
from pyspark.databricks.sql import functions as dbf

dbf.window(timeColumn=<timeColumn>, windowDuration=<windowDuration>, slideDuration=<slideDuration>, startTime=<startTime>)

Parameters

Parameter	Type	Description
`timeColumn`	`pyspark.sql.Column` or `str`	The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
`windowDuration`	`literal string`	A string specifying the width of the window, e.g. `10 minutes`, `1 second`. Check `org.apache.spark.unsafe.types.CalendarInterval` for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example, `1 day` always means 86,400,000 milliseconds, not a calendar day.
`slideDuration`	`literal string, optional`	A new window will be generated every `slideDuration`. Must be less than or equal to the `windowDuration`. Check `org.apache.spark.unsafe.types.CalendarInterval` for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar.
`startTime`	`literal string, optional`	The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide `startTime` as `15 minutes`.

Returns

pyspark.sql.Column: the column for computed results.

Examples

Python
import datetime
from pyspark.databricks.sql import functions as dbf
df = spark.createDataFrame([(datetime.datetime(2016, 3, 11, 9, 0, 7), 1)], ['dt', 'v'])
df2 = df.groupBy(dbf.window('dt', '5 seconds')).agg(dbf.sum('v'))
df2.show(truncate=False)
df2.printSchema()

Syntax​

Parameters​

Returns​

Examples​

Syntax

Parameters

Returns

Examples