Skip to main content

string_agg_distinct

Aggregate function: returns the concatenation of distinct non-null input values, separated by the delimiter. An alias of listagg_distinct.

Syntax

Python
import pyspark.sql.functions as sf

sf.string_agg_distinct(col=<col>)

# With delimiter
sf.string_agg_distinct(col=<col>, delimiter=<delimiter>)

Parameters

Parameter

Type

Description

col

pyspark.sql.Column or str

Target column to compute on.

delimiter

pyspark.sql.Column, str, or bytes

Optional. The delimiter to separate the values. The default value is None.

Returns

pyspark.sql.Column: the column for computed results.

Examples

Example 1: Using string_agg_distinct function.

Python
import pyspark.sql.functions as sf
df = spark.createDataFrame([('a',), ('b',), (None,), ('c',), ('b',)], ['strings'])
df.select(sf.string_agg_distinct('strings')).show()
Output
+----------------------------------+
|string_agg(DISTINCT strings, NULL)|
+----------------------------------+
| abc|
+----------------------------------+

Example 2: Using string_agg_distinct function with a delimiter.

Python
import pyspark.sql.functions as sf
df = spark.createDataFrame([('a',), ('b',), (None,), ('c',), ('b',)], ['strings'])
df.select(sf.string_agg_distinct('strings', ', ')).show()
Output
+--------------------------------+
|string_agg(DISTINCT strings, , )|
+--------------------------------+
| a, b, c|
+--------------------------------+

Example 3: Using string_agg_distinct function with a binary column and delimiter.

Python
import pyspark.sql.functions as sf
df = spark.createDataFrame([(b'\x01',), (b'\x02',), (None,), (b'\x03',), (b'\x02',)],
['bytes'])
df.select(sf.string_agg_distinct('bytes', b'\x42')).show()
Output
+---------------------------------+
|string_agg(DISTINCT bytes, X'42')|
+---------------------------------+
| [01 42 02 42 03]|
+---------------------------------+

Example 4: Using string_agg_distinct function on a column with all None values.

Python
import pyspark.sql.functions as sf
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("strings", StringType(), True)])
df = spark.createDataFrame([(None,), (None,), (None,), (None,)], schema=schema)
df.select(sf.string_agg_distinct('strings')).show()
Output
+----------------------------------+
|string_agg(DISTINCT strings, NULL)|
+----------------------------------+
| NULL|
+----------------------------------+