except returns a DataFrame consisting of rows that are in the first specified DataFrame, but not in the second. This is equivalent to the EXCEPT query in SQL.
This is useful for splitting datasets into training and test sets.
- except(df1, df2)
- df1: Any SparkR DataFrame. Rows you are interested in are in this df
- df2: Any SparkR DataFrame. Rows you are interested in are not in this df
- SparkR DataFrame
# Read diamonds.csv dataset as SparkR DataFrame diamonds <- read.df(sqlContext, "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "com.databricks.spark.csv", header="true", inferSchema = "true") head(diamonds)
# Count number of rows in the diamonds dataset nrow(diamonds)
# Create a 0.7 sample of the original dataset trainingData <- sample(diamonds, FALSE, 0.7) # Count number of rows in trainingData nrow(trainingData)
# Use except() to create a new DataFrame consisting of rows in the diamonds dataset that are not in trainingData testData <- except(diamonds, trainingData) # Count number of rows in testData nrow(testData)