Skip to main content

shuffle

Generates a random permutation of the given array. The shuffle function is non-deterministic, meaning the order of the output array can be different for each execution.

Syntax

Python
from pyspark.sql import functions as sf

sf.shuffle(col, seed=None)

Parameters

Parameter

Type

Description

col

pyspark.sql.Column or str

The name of the column or expression to be shuffled.

seed

pyspark.sql.Column or int, optional

Seed value for the random generator.

Returns

pyspark.sql.Column: A new column that contains an array of elements in random order.

Examples

Example 1: Shuffling a simple array

Python
import pyspark.sql.functions as sf
df = spark.sql("SELECT ARRAY(1, 20, 3, 5) AS data")
df.select("*", sf.shuffle(df.data, sf.lit(123))).show()
Output
+-------------+-------------+
| data|shuffle(data)|
+-------------+-------------+
|[1, 20, 3, 5]|[5, 1, 20, 3]|
+-------------+-------------+

Example 2: Shuffling an array with null values

Python
import pyspark.sql.functions as sf
df = spark.sql("SELECT ARRAY(1, 20, NULL, 5) AS data")
df.select("*", sf.shuffle(sf.col("data"), 234)).show()
Output
+----------------+----------------+
| data| shuffle(data)|
+----------------+----------------+
|[1, 20, NULL, 5]|[NULL, 5, 20, 1]|
+----------------+----------------+

Example 3: Shuffling an array with duplicate values

Python
import pyspark.sql.functions as sf
df = spark.sql("SELECT ARRAY(1, 2, 2, 3, 3, 3) AS data")
df.select("*", sf.shuffle("data", 345)).show()
Output
+------------------+------------------+
| data| shuffle(data)|
+------------------+------------------+
|[1, 2, 2, 3, 3, 3]|[2, 3, 3, 1, 2, 3]|
+------------------+------------------+