window
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported.
The time column must be of pyspark.sql.types.TimestampType.
Durations are provided as strings, e.g. '1 second', '1 day 12 hours', '2 minutes'. Valid
interval strings are 'week', 'day', 'hour', 'minute', 'second', 'millisecond', 'microsecond'.
If the slideDuration is not provided, the windows will be tumbling windows.
The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start
window intervals. For example, in order to have hourly tumbling windows that start 15 minutes
past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes.
The output column will be a struct called 'window' by default with the nested columns 'start'
and 'end', where 'start' and 'end' will be of pyspark.sql.types.TimestampType.
For the corresponding Databricks SQL function, see window grouping expression.
Syntax
from pyspark.databricks.sql import functions as dbf
dbf.window(timeColumn=<timeColumn>, windowDuration=<windowDuration>, slideDuration=<slideDuration>, startTime=<startTime>)
Parameters
Parameter | Type | Description |
|---|---|---|
|
| The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType. |
|
| A string specifying the width of the window, e.g. |
|
| A new window will be generated every |
|
| The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide |
Returns
pyspark.sql.Column: the column for computed results.
Examples
import datetime
from pyspark.databricks.sql import functions as dbf
df = spark.createDataFrame([(datetime.datetime(2016, 3, 11, 9, 0, 7), 1)], ['dt', 'v'])
df2 = df.groupBy(dbf.window('dt', '5 seconds')).agg(dbf.sum('v'))
df2.show(truncate=False)
df2.printSchema()