glm fits a Generalized Linear Model, similar to R’s glm().
Note: If you are planning to use a string column as your label, ensure that you are running Spark 1.6+ or glm might throw an error.
Syntax: - glm(formula, data, family...)
Parameters: - formula: Symbolic description of model to be fitted, for eg: ResponseVariable ~ Predictor1 + Predictor2. Currently supported operators: ‘~’, ‘+’, ‘-‘, and ‘.’ - data: Any SparkDataFrame - family: String, “gaussian” for Linear Regression, or “binomial” for Logistic Regression - lambda: Numeric, Regularization parameter - alpha: Numeric, Elastic-net mixing parameter
Output: - MLlib PipelineModel
require(SparkR) # Read diamonds.csv dataset as SparkDataFrame diamonds <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "com.databricks.spark.csv", header="true", inferSchema = "true") diamonds <- withColumnRenamed(diamonds, "", "rowID") # Split data into Training set and Test set trainingData <- sample(diamonds, FALSE, 0.7) testData <- except(diamonds, trainingData) # Exclude rowIDs trainingData <- trainingData[, -1] testData <- testData[, -1] print(count(diamonds)) print(count(trainingData)) print(count(testData))
Training a Linear Regression model using glm()¶
We will try to predict a diamond’s price from its features. We will do this by training a Linear Regression model using the training data.
Note that we have a mix of categorical features (for eg: cut - Ideal, Premium, Very Good...) and continuous features (for eg: depth, carat). Under the hood, SparkR automatically performs one-hot encoding of such features so that it does not have to be done manually.
# Indicate family = "gaussian" to train a linear regression model lrModel <- glm(price ~ ., data = trainingData, family = "gaussian") # Print a summary of trained linear regression model summary(lrModel)
We will use predict() on the test data to see how well our model works on new data.
Syntax for predict(): - predict(model, newData)
Parameters: - model: MLlib model - newData: SparkDataFrame, typically your test set
Output: - SparkDataFrame
# Generate predictions using the trained Linear Regression model predictions <- predict(lrModel, newData = testData) # View predictions against mpg column display(select(predictions, "price", "prediction"))
Let’s evaluate our models to see how it performed.
errors <- select(predictions, predictions$price, predictions$prediction, alias(predictions$price - predictions$prediction, "error")) display(errors)
# Calculate RMSE head(select(errors, alias(sqrt(sum(errors$error^2 , na.rm = TRUE) / nrow(errors)), "RMSE")))
Training a Logistic Regression model using glm()¶
We can create a Logistic Regression on the same dataset. Let’s see if we can predict a diamond’s cut based on some of its features.
As of Spark 1.6, Logistic Regression in MLlib only supports binary classification. To test out the algorithm with our dataset in this example, we will subset our data such that we are able to work with only 2 labels.
# Subset data to include rows where diamond cut = "Premium" or diamond cut = "Very Good" trainingDataSub <- subset(trainingData, trainingData$cut %in% c("Premium", "Very Good")) testDataSub <- subset(testData, testData$cut %in% c("Premium", "Very Good"))
# Indicate family = "binomial" to train a logistic regression model logrModel <- glm(cut ~ price + color + clarity + depth, data = trainingDataSub, family = "binomial") # Print summary of Logistic Regression model # Note: This only works in Spark 1.6+ summary(logrModel)
# Generate predictions using the trained Linear Regression model predictionsLogR <- predict(logrModel, newData = testDataSub) # View predictions against label column display(select(predictionsLogR, "label", "prediction"))
# Evaluate Logistic Regression model errorsLogR <- select(predictionsLogR, predictionsLogR$label, predictionsLogR$prediction, alias(abs(predictionsLogR$label - predictionsLogR$prediction), "error")) display(errorsLogR)