joinΒΆ

The join function joins any two SparkDataFrames based on the given join expression. If no join expression is given, it will perform a Cartesian join <https://en.wikipedia.org/wiki/Cartesian_product>__.

Syntax:

  • join(df1, df2, joinExpr, joinType)

Parameters:

  • df1: Any SparkDataFrame
  • df2: Any SparkDataFrame
  • joinExpr: Join Expression, Optional
  • joinType: Type of Join: “inner”, “outer”, “left_outer”, “right_outer”, “semijoin”. Default joinType: “inner”

Output:

  • SparkDataFrame
require(SparkR)

authors <- data.frame(surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"),
                      nationality = c("US", "Australia", "US", "UK", "Australia"),
                      deceased = c("yes", rep("no", 4)))

books <- data.frame(name = c("Tukey", "Venables", "Tierney", "Ripley", "Ripley", "McNeil", "R Core"),
                    title = c("Exploratory Data Analysis", "Modern Applied Statistics ...", "LISP-STAT", "Spatial Statistics", "Stochastic Simulation",
                              "Interactive Data Analysis", "An Introduction to R"))

# Create SparkDataFrame
authorsDF <- createDataFrame(authors)
booksDF <- createDataFrame(books)

head(authorsDF)
head(booksDF)
# Join authorsDF and booksDF with JoinExpr: authorsDF$surname == booksDF$name
# No joinType specified, so defaults to "inner" join
joinDF <- join(authorsDF, booksDF, authorsDF$surname == booksDF$name)
head(joinDF)
# Join authorsDF and booksDF with no JoinExpr, defaults to Cartesian Join
cartesianDF <- join(authorsDF, booksDF)
head(cartesianDF)
# Count number of rows in cartesianDF
# Since thisis a Cartesian Join, this should return nrow(authorsDF) * nrow(booksDF)
count(cartesianDF)