merge

merge joins two SparkR DataFrames. It is similar to base R’s merge <https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/merge>__ function.

Note:

This is only available on Spark 1.6+

Syntax:

  • merge(x, y, ...)

Parameters:

  • x: Any Spark R DataFrame
  • y: Any Spark R DataFrame
  • by: Character vector specifying the join columns. If by is not specified, the common column names in x and y will be used
  • by.x: Character vector specifying the joining columns for x
  • by.y: Character vector specifying the joining columns for y
  • all.x: Boolean value indicating whether all rows in x should be included in the join
  • all.y: Boolean value indicating whether all rows in y should be included in the join
  • sort: Boolean value indicating whether the resulting columns should be sorted

Output:

  • SparkR DataFrame

Note:

  • Natural Join: set all.x and all.y to FALSE
  • Left Outer Join: set all.x to TRUE and all.y to FALSE
  • Right Outer Join: set all.x to FALSE and all.y to TRUE
  • Full Outer Join: set all.x and all.y to TRUE
authors <- data.frame(surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"),
                      nationality = c("US", "Australia", "US", "UK", "Australia"),
                      deceased = c("yes", rep("no", 4)))

books <- data.frame(name = c("Tukey", "Tierney", "Ripley", "Ripley", "McNeil", "R Core"),
                    title = c("Exploratory Data Analysis", "LISP-STAT", "Spatial Statistics", "Stochastic Simulation",
                              "Interactive Data Analysis", "An Introduction to R"))

# Create SparkR DataFrame
authorsDF <- createDataFrame(sqlContext, authors)
booksDF <- createDataFrame(sqlContext, books)

head(authorsDF)
head(booksDF)
# merge() performs a Cartesian Join if all.x and all.y are not set
cartesianDF <- merge(authorsDF, booksDF)

# Since this is a Cartesian Join, this should return nrow(authorsDF) * nrow(booksDF)
count(cartesianDF)
collect(cartesianDF)
# Perform Left Outer Join
leftJoinDF <- merge(authorsDF, booksDF, by.x = "surname", by.y = "name", all.x = TRUE, all.y = FALSE)
collect(leftJoinDF)