Skip to main content

listagg_distinct

Aggregate function: returns the concatenation of distinct non-null input values, separated by the delimiter.

Syntax

Python
import pyspark.sql.functions as sf

sf.listagg_distinct(col=<col>)

# With delimiter
sf.listagg_distinct(col=<col>, delimiter=<delimiter>)

Parameters

Parameter

Type

Description

col

pyspark.sql.Column or str

Target column to compute on.

delimiter

pyspark.sql.Column, str, or bytes

Optional. The delimiter to separate the values. The default value is None.

Returns

pyspark.sql.Column: the column for computed results.

Examples

Example 1: Using listagg_distinct function.

Python
import pyspark.sql.functions as sf
df = spark.createDataFrame([('a',), ('b',), (None,), ('c',), ('b',)], ['strings'])
df.select(sf.listagg_distinct('strings')).show()
Output
+-------------------------------+
|listagg(DISTINCT strings, NULL)|
+-------------------------------+
| abc|
+-------------------------------+

Example 2: Using listagg_distinct function with a delimiter.

Python
import pyspark.sql.functions as sf
df = spark.createDataFrame([('a',), ('b',), (None,), ('c',), ('b',)], ['strings'])
df.select(sf.listagg_distinct('strings', ', ')).show()
Output
+-----------------------------+
|listagg(DISTINCT strings, , )|
+-----------------------------+
| a, b, c|
+-----------------------------+

Example 3: Using listagg_distinct function with a binary column and delimiter.

Python
import pyspark.sql.functions as sf
df = spark.createDataFrame([(b'\x01',), (b'\x02',), (None,), (b'\x03',), (b'\x02',)],
['bytes'])
df.select(sf.listagg_distinct('bytes', b'\x42')).show()
Output
+------------------------------+
|listagg(DISTINCT bytes, X'42')|
+------------------------------+
| [01 42 02 42 03]|
+------------------------------+

Example 4: Using listagg_distinct function on a column with all None values.

Python
import pyspark.sql.functions as sf
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("strings", StringType(), True)])
df = spark.createDataFrame([(None,), (None,), (None,), (None,)], schema=schema)
df.select(sf.listagg_distinct('strings')).show()
Output
+-------------------------------+
|listagg(DISTINCT strings, NULL)|
+-------------------------------+
| NULL|
+-------------------------------+