min_by

Returns the value from the col parameter that is associated with the minimum value from the ord parameter. This function is often used to find the col parameter value corresponding to the minimum ord parameter value within each group when used with groupBy(). The function is non-deterministic so the output order can be different for those associated the same values of col.

Syntax

Python
from pyspark.sql import functions as sf

sf.min_by(col, ord)

Parameters

Parameter	Type	Description
`col`	`pyspark.sql.Column` or column name	The column representing the values that will be returned. This could be the column instance or the column name as string.
`ord`	`pyspark.sql.Column` or column name	The column that needs to be minimized. This could be the column instance or the column name as string.

Parameter	Type	Description
`col`	`pyspark.sql.Column` or column name	The column representing the values that will be returned. This could be the column instance or the column name as string.
`ord`	`pyspark.sql.Column` or column name	The column that needs to be minimized. This could be the column instance or the column name as string.

Returns

pyspark.sql.Column: Column object that represents the value from col associated with the minimum value from ord.

Examples

Example 1: Using min_by with groupBy

Python
import pyspark.sql.functions as sf
df = spark.createDataFrame([
    ("Java", 2012, 20000), ("dotNET", 2012, 5000),
    ("dotNET", 2013, 48000), ("Java", 2013, 30000)],
    schema=("course", "year", "earnings"))
df.groupby("course").agg(sf.min_by("year", "earnings")).sort("course").show()

Output
+------+----------------------+
|course|min_by(year, earnings)|
+------+----------------------+
|  Java|                  2012|
|dotNET|                  2012|
+------+----------------------+

Example 2: Using min_by with different data types

Python
import pyspark.sql.functions as sf
df = spark.createDataFrame([
    ("Marketing", "Anna", 4), ("IT", "Bob", 2),
    ("IT", "Charlie", 3), ("Marketing", "David", 1)],
    schema=("department", "name", "years_in_dept"))
df.groupby("department").agg(
    sf.min_by("name", "years_in_dept")
).sort("department").show()

Output
+----------+---------------------------+
|department|min_by(name, years_in_dept)|
+----------+---------------------------+
|        IT|                        Bob|
| Marketing|                      David|
+----------+---------------------------+

Example 3: Using min_by where ord has multiple minimum values

Python
import pyspark.sql.functions as sf
df = spark.createDataFrame([
    ("Consult", "Eva", 6), ("Finance", "Frank", 5),
    ("Finance", "George", 9), ("Consult", "Henry", 7)],
    schema=("department", "name", "years_in_dept"))
df.groupby("department").agg(
    sf.min_by("name", "years_in_dept")
).sort("department").show()

Output
+----------+---------------------------+
|department|min_by(name, years_in_dept)|
+----------+---------------------------+
|   Consult|                        Eva|
|   Finance|                      Frank|
+----------+---------------------------+

Syntax​

Parameters​

Returns​

Examples​

Syntax

Parameters

Returns

Examples