clusterBy (DataStreamWriter)

Clusters the output by the given columns. Records with similar values on the clustering columns are grouped together in the same file. Clustering improves query efficiency by allowing queries with predicates on the clustering columns to skip unnecessary data. Unlike partitioning, clustering can be used on high-cardinality columns.

Syntax

clusterBy(*cols)

Parameters

Parameter	Type	Description
`*cols`	str or list	Names of the columns to cluster by.

Returns

DataStreamWriter

Examples

Python
df = spark.readStream.format("rate").load()
df.writeStream.clusterBy("value")
# <...streaming.readwriter.DataStreamWriter object ...>

Cluster a Rate source stream by timestamp and write to Parquet:

Python
import tempfile
import time
with tempfile.TemporaryDirectory(prefix="clusterBy1") as d:
    with tempfile.TemporaryDirectory(prefix="clusterBy2") as cp:
        df = spark.readStream.format("rate").option("rowsPerSecond", 10).load()
        q = df.writeStream.clusterBy(
            "timestamp").format("parquet").option("checkpointLocation", cp).start(d)
        time.sleep(5)
        q.stop()
        spark.read.schema(df.schema).parquet(d).show()

Syntax​

Parameters​

Returns​

Examples​

Syntax

Parameters

Returns

Examples