exceptΒΆ

except returns a SparkDataFrame consisting of rows that are in the first specified SparkDataFrame, but not in the second. This is equivalent to the EXCEPT query in SQL.

This is useful for splitting datasets into training and test sets.

Syntax:

  • except(df1, df2)

Parameters:

  • df1: Any SparkDataFrame. Rows you are interested in are in this df
  • df2: Any SparkDataFrame. Rows you are interested in are not in this df

Output:

  • SparkDataFrame
require(SparkR)

# Read diamonds.csv dataset as SparkDataFrame
diamonds <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                  source = "csv", header="true", inferSchema = "true")
head(diamonds)
# Count number of rows in the diamonds dataset
nrow(diamonds)
# Create a 0.7 sample of the original dataset
trainingData <- sample(diamonds, FALSE, 0.7)

# Count number of rows in trainingData
nrow(trainingData)
# Use except() to create a new SparkDataFrame consisting of rows in the diamonds dataset that are not in trainingData
testData <- except(diamonds, trainingData)

# Count number of rows in testData
nrow(testData)