scalar
Return a Column object for a SCALAR Subquery containing exactly one row and one column.
Syntax
scalar()
Returns
Column: A Column object representing a SCALAR subquery.
Notes
The scalar() method is useful for extracting a Column object that represents a scalar value from a DataFrame, especially when the DataFrame results from an aggregation or single-value computation. This returned Column can then be used directly in select clauses or as predicates in filters on the outer DataFrame, enabling dynamic data filtering and calculations based on scalar values.
Examples
Python
data = [
(1, "Alice", 45000, 101), (2, "Bob", 54000, 101), (3, "Charlie", 29000, 102),
(4, "David", 61000, 102), (5, "Eve", 48000, 101),
]
employees = spark.createDataFrame(data, ["id", "name", "salary", "department_id"])
from pyspark.sql import functions as sf
employees.where(
sf.col("salary") > employees.select(sf.avg("salary")).scalar()
).select("name", "salary", "department_id").orderBy("name").show()
# +-----+------+-------------+
# | name|salary|department_id|
# +-----+------+-------------+
# | Bob| 54000| 101|
# |David| 61000| 102|
# | Eve| 48000| 101|
# +-----+------+-------------+
employees.alias("e1").where(
sf.col("salary")
> employees.alias("e2").where(
sf.col("e2.department_id") == sf.col("e1.department_id").outer()
).select(sf.avg("salary")).scalar()
).select("name", "salary", "department_id").orderBy("name").show()
# +-----+------+-------------+
# | name|salary|department_id|
# +-----+------+-------------+
# | Bob| 54000| 101|
# |David| 61000| 102|
# +-----+------+-------------+