Skip to main content

bucketBy

Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme, but with a different bucket hash function and is not compatible with Hive's bucketing.

Syntax

bucketBy(numBuckets, col, *cols)

Parameters

Parameter

Type

Description

numBuckets

int

The number of buckets to save.

col

str, list, or tuple

A column name, or a list of names.

*cols

str, optional

Additional column names. Must be empty if col is a list.

Returns

DataFrameWriter

Notes

Applicable for file-based data sources in combination with DataFrameWriter.saveAsTable.

Examples

Write a DataFrame into a bucketed table, and read it back.

Python
spark.sql("DROP TABLE IF EXISTS bucketed_table")
spark.createDataFrame([
(100, "Alice"), (120, "Alice"), (140, "Bob")],
schema=["age", "name"]
).write.bucketBy(2, "name").mode("overwrite").saveAsTable("bucketed_table")

spark.read.table("bucketed_table").sort("age").show()
# +---+------------+
# |age| name|
# +---+------------+
# |100|Alice|
# |120|Alice|
# |140| Bob|
# +---+------------+

spark.sql("DROP TABLE bucketed_table")