select

The select function selects the specified columns and returns it as a new DataFrame. This is similar to the SELECT statement in SQL.

Syntax:

  • select(df, “col”, ...)
  • select(df, df$col)

Parameters:

  • df: Any Spark R DataFrame
  • col: Column in Spark R DataFrame

Output:

  • SparkR DataFrame
# Create SparkR DataFrame
df <- createDataFrame(sqlContext, airquality)
head(df)

Since select returns a SparkR DataFrame, we will need to use functions like head, take or collect to view the resulting DataFrame.

head(select(df, "Ozone"))
# Alternative R-like syntax for indicating df columns
head(select(df, df$Ozone))
# We're also able to select multiple columns
head(select(df, "Ozone", "Wind"))

select is useful for reading the column objects returned by other functions.

countDistinct(df$Ozone)
# Use select() to read Column COUNT(DISTINCT Ozone)
head(select(df, countDistinct(df$Ozone)))