Skip to main content

clusterBy (DataStreamWriter)

Clusters the output by the given columns. Records with similar values on the clustering columns are grouped together in the same file. Clustering improves query efficiency by allowing queries with predicates on the clustering columns to skip unnecessary data. Unlike partitioning, clustering can be used on high-cardinality columns.

Syntax

clusterBy(*cols)

Parameters

Parameter

Type

Description

*cols

str or list

Names of the columns to cluster by.

Returns

DataStreamWriter

Examples

Python
df = spark.readStream.format("rate").load()
df.writeStream.clusterBy("value")
# <...streaming.readwriter.DataStreamWriter object ...>

Cluster a Rate source stream by timestamp and write to Parquet:

Python
import tempfile
import time
with tempfile.TemporaryDirectory(prefix="clusterBy1") as d:
with tempfile.TemporaryDirectory(prefix="clusterBy2") as cp:
df = spark.readStream.format("rate").option("rowsPerSecond", 10).load()
q = df.writeStream.clusterBy(
"timestamp").format("parquet").option("checkpointLocation", cp).start(d)
time.sleep(5)
q.stop()
spark.read.schema(df.schema).parquet(d).show()