except

except returns a DataFrame consisting of rows that are in the first specified DataFrame, but not in the second. This is equivalent to the EXCEPT query in SQL.

This is useful for splitting datasets into training and test sets.

Syntax:

  • except(df1, df2)

Parameters:

  • df1: Any SparkR DataFrame. Rows you are interested in are in this df
  • df2: Any SparkR DataFrame. Rows you are interested in are not in this df

Output:

  • SparkR DataFrame
# Read diamonds.csv dataset as SparkR DataFrame
diamonds <- read.df(sqlContext, "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                  source = "com.databricks.spark.csv", header="true", inferSchema = "true")
head(diamonds)
# Count number of rows in the diamonds dataset
nrow(diamonds)
# Create a 0.7 sample of the original dataset
trainingData <- sample(diamonds, FALSE, 0.7)

# Count number of rows in trainingData
nrow(trainingData)
# Use except() to create a new DataFrame consisting of rows in the diamonds dataset that are not in trainingData
testData <- except(diamonds, trainingData)

# Count number of rows in testData
nrow(testData)