Using glm

glm fits a Generalized Linear Model, similar to R’s glm().

Note: If you are planning to use a string column as your label, ensure that you are running Spark 1.6+ or glm might throw an error.

Syntax: - glm(formula, data, family...)

Parameters: - formula: Symbolic description of model to be fitted, for eg: ResponseVariable ~ Predictor1 + Predictor2. Currently supported operators: ‘~’, ‘+’, ‘-‘, and ‘.’ - data: Any SparkDataFrame - family: String, “gaussian” for Linear Regression, or “binomial” for Logistic Regression - lambda: Numeric, Regularization parameter - alpha: Numeric, Elastic-net mixing parameter

Output: - MLlib PipelineModel

require(SparkR)

# Read diamonds.csv dataset as SparkDataFrame
diamonds <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                  source = "com.databricks.spark.csv", header="true", inferSchema = "true")
diamonds <- withColumnRenamed(diamonds, "", "rowID")

# Split data into Training set and Test set
trainingData <- sample(diamonds, FALSE, 0.7)
testData <- except(diamonds, trainingData)

# Exclude rowIDs
trainingData <- trainingData[, -1]
testData <- testData[, -1]

print(count(diamonds))
print(count(trainingData))
print(count(testData))
head(trainingData)

Training a Linear Regression model using glm()

We will try to predict a diamond’s price from its features. We will do this by training a Linear Regression model using the training data.

Note that we have a mix of categorical features (for eg: cut - Ideal, Premium, Very Good...) and continuous features (for eg: depth, carat). Under the hood, SparkR automatically performs one-hot encoding of such features so that it does not have to be done manually.

# Indicate family = "gaussian" to train a linear regression model
lrModel <- glm(price ~ ., data = trainingData, family = "gaussian")

# Print a summary of trained linear regression model
summary(lrModel)

We will use predict() on the test data to see how well our model works on new data.

Syntax for predict(): - predict(model, newData)

Parameters: - model: MLlib model - newData: SparkDataFrame, typically your test set

Output: - SparkDataFrame

# Generate predictions using the trained Linear Regression model
predictions <- predict(lrModel, newData = testData)

# View predictions against mpg column
display(select(predictions, "price", "prediction"))

Let’s evaluate our models to see how it performed.

errors <- select(predictions, predictions$price, predictions$prediction, alias(predictions$price - predictions$prediction, "error"))
display(errors)
# Calculate RMSE
head(select(errors, alias(sqrt(sum(errors$error^2 , na.rm = TRUE) / nrow(errors)), "RMSE")))

Training a Logistic Regression model using glm()

We can create a Logistic Regression on the same dataset. Let’s see if we can predict a diamond’s cut based on some of its features.

As of Spark 1.6, Logistic Regression in MLlib only supports binary classification. To test out the algorithm with our dataset in this example, we will subset our data such that we are able to work with only 2 labels.

# Subset data to include rows where diamond cut = "Premium" or diamond cut = "Very Good"
trainingDataSub <- subset(trainingData, trainingData$cut %in% c("Premium", "Very Good"))
testDataSub <- subset(testData, testData$cut %in% c("Premium", "Very Good"))
# Indicate family = "binomial" to train a logistic regression model
logrModel <- glm(cut ~ price + color + clarity + depth, data = trainingDataSub, family = "binomial")

# Print summary of Logistic Regression model
# Note: This only works in Spark 1.6+
summary(logrModel)
# Generate predictions using the trained Linear Regression model
predictionsLogR <- predict(logrModel, newData = testDataSub)

# View predictions against label column
display(select(predictionsLogR, "label", "prediction"))
# Evaluate Logistic Regression model
errorsLogR <- select(predictionsLogR, predictionsLogR$label, predictionsLogR$prediction, alias(abs(predictionsLogR$label - predictionsLogR$prediction), "error"))
display(errorsLogR)