except returns a SparkDataFrame consisting of rows that are in the first specified SparkDataFrame, but not in the second. This is equivalent to the EXCEPT query in SQL.
This is useful for splitting datasets into training and test sets.
- except(df1, df2)
- df1: Any SparkDataFrame. Rows you are interested in are in this df
- df2: Any SparkDataFrame. Rows you are interested in are not in this df
require(SparkR) # Read diamonds.csv dataset as SparkDataFrame diamonds <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv", header="true", inferSchema = "true") head(diamonds)
# Count number of rows in the diamonds dataset nrow(diamonds)
# Create a 0.7 sample of the original dataset trainingData <- sample(diamonds, FALSE, 0.7) # Count number of rows in trainingData nrow(trainingData)
# Use except() to create a new SparkDataFrame consisting of rows in the diamonds dataset that are not in trainingData testData <- except(diamonds, trainingData) # Count number of rows in testData nrow(testData)