mergeΒΆ

merge joins two SparkDataFrames. It is similar to base R’s merge <http://stat.ethz.ch/R-manual/R-patched/library/base/html/merge.html>__ function.

Note:

This is only available on Spark 1.6+

Syntax:

  • merge(x, y, ...)

Parameters:

  • x: Any SparkDataFrame
  • y: Any SparkDataFrame
  • by: Character vector specifying the join columns. If by is not specified, the common column names in x and y will be used
  • by.x: Character vector specifying the joining columns for x
  • by.y: Character vector specifying the joining columns for y
  • all.x: Boolean value indicating whether all rows in x should be included in the join
  • all.y: Boolean value indicating whether all rows in y should be included in the join
  • sort: Boolean value indicating whether the resulting columns should be sorted

Output:

  • SparkDataFrame

Note:

  • Natural Join: set all.x and all.y to FALSE
  • Left Outer Join: set all.x to TRUE and all.y to FALSE
  • Right Outer Join: set all.x to FALSE and all.y to TRUE
  • Full Outer Join: set all.x and all.y to TRUE
require(SparkR)

authors <- data.frame(surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"),
                      nationality = c("US", "Australia", "US", "UK", "Australia"),
                      deceased = c("yes", rep("no", 4)))

books <- data.frame(name = c("Tukey", "Tierney", "Ripley", "Ripley", "McNeil", "R Core"),
                    title = c("Exploratory Data Analysis", "LISP-STAT", "Spatial Statistics", "Stochastic Simulation",
                              "Interactive Data Analysis", "An Introduction to R"))

# Create SparkDataFrame
authorsDF <- createDataFrame(authors)
booksDF <- createDataFrame(books)

head(authorsDF)
head(booksDF)
# merge() performs a Cartesian Join if all.x and all.y are not set
cartesianDF <- merge(authorsDF, booksDF)

# Since this is a Cartesian Join, this should return nrow(authorsDF) * nrow(booksDF)
count(cartesianDF)
collect(cartesianDF)
# Perform Left Outer Join
leftJoinDF <- merge(authorsDF, booksDF, by.x = "surname", by.y = "name", all.x = TRUE, all.y = FALSE)
collect(leftJoinDF)