Skip to main content

partitionBy (DataStreamWriter)

Partitions the output by the given columns on the file system. The output is laid out similar to Hive's partitioning scheme.

Syntax

partitionBy(*cols)

Parameters

Parameter

Type

Description

*cols

str or list

Names of the columns to partition by.

Returns

DataStreamWriter

Examples

Python
df = spark.readStream.format("rate").load()
df.writeStream.partitionBy("value")
# <...streaming.readwriter.DataStreamWriter object ...>

Partition a Rate source stream by timestamp and write to Parquet:

Python
import tempfile
import time
with tempfile.TemporaryDirectory(prefix="partitionBy1") as d:
with tempfile.TemporaryDirectory(prefix="partitionBy2") as cp:
df = spark.readStream.format("rate").option("rowsPerSecond", 10).load()
q = df.writeStream.partitionBy(
"timestamp").format("parquet").option("checkpointLocation", cp).start(d)
time.sleep(5)
q.stop()
spark.read.schema(df.schema).parquet(d).show()