bucketBy

Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme, but with a different bucket hash function and is not compatible with Hive's bucketing.

Syntax

bucketBy(numBuckets, col, *cols)

Parameters

Parameter	Type	Description
`numBuckets`	int	The number of buckets to save.
`col`	str, list, or tuple	A column name, or a list of names.
`*cols`	str, optional	Additional column names. Must be empty if `col` is a list.

Returns

DataFrameWriter

Notes

Applicable for file-based data sources in combination with DataFrameWriter.saveAsTable.

Examples

Write a DataFrame into a bucketed table, and read it back.

Python
spark.sql("DROP TABLE IF EXISTS bucketed_table")
spark.createDataFrame([
    (100, "Alice"), (120, "Alice"), (140, "Bob")],
    schema=["age", "name"]
).write.bucketBy(2, "name").mode("overwrite").saveAsTable("bucketed_table")

spark.read.table("bucketed_table").sort("age").show()
# +---+------------+
# |age|        name|
# +---+------------+
# |100|Alice|
# |120|Alice|
# |140| Bob|
# +---+------------+

spark.sql("DROP TABLE bucketed_table")

Syntax​

Parameters​

Returns​

Notes​

Examples​

Syntax

Parameters

Returns

Notes

Examples