glm fits a Generalized Linear Model, similar to R’s glm().
glm(formula, data, family...)
formula: Symbolic description of model to be fitted, for eg:
ResponseVariable ~ Predictor1 + Predictor2. Supported operators:
data: Any SparkDataFrame
"gaussian"for linear regression or
"binomial"for logistic regression
lambda: Numeric, Regularization parameter
alpha: Numeric, Elastic-net mixing parameter
Output: MLlib PipelineModel
This tutorial shows how to perform linear and logistic regression on the diamonds dataset.
Load diamonds data and split into training and test sets
require(SparkR) # Read diamonds.csv dataset as SparkDataFrame diamonds <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "com.databricks.spark.csv", header="true", inferSchema = "true") diamonds <- withColumnRenamed(diamonds, "", "rowID") # Split data into Training set and Test set trainingData <- sample(diamonds, FALSE, 0.7) testData <- except(diamonds, trainingData) # Exclude rowIDs trainingData <- trainingData[, -1] testData <- testData[, -1] print(count(diamonds)) print(count(trainingData)) print(count(testData))
Train a linear regression model using
This section shows how to predict a diamond’s price from its features by training a linear regression model using the training data.
There are mix of categorical features (cut - Ideal, Premium, Very Good…) and continuous features (depth, carat). Under the hood, SparkR automatically performs one-hot encoding of such features so that it does not have to be done manually.
# Family = "gaussian" to train a linear regression model lrModel <- glm(price ~ ., data = trainingData, family = "gaussian") # Print a summary of the trained model summary(lrModel)
predict() on the test data to see how well the model works on new data.
model: MLlib model
newData: SparkDataFrame, typically your test set
# Generate predictions using the trained model predictions <- predict(lrModel, newData = testData) # View predictions against mpg column display(select(predictions, "price", "prediction"))
Evaluate the model.
errors <- select(predictions, predictions$price, predictions$prediction, alias(predictions$price - predictions$prediction, "error")) display(errors) # Calculate RMSE head(select(errors, alias(sqrt(sum(errors$error^2 , na.rm = TRUE) / nrow(errors)), "RMSE")))
Train a logistic regression model using
This section shows how to create a logistic regression on the same dataset to predict a diamond’s cut based on some of its features.
Logistic regression in MLlib supports only binary classification. To test the algorithm in this example, subset the data to work with only 2 labels.
# Subset data to include rows where diamond cut = "Premium" or diamond cut = "Very Good" trainingDataSub <- subset(trainingData, trainingData$cut %in% c("Premium", "Very Good")) testDataSub <- subset(testData, testData$cut %in% c("Premium", "Very Good"))
# Family = "binomial" to train a logistic regression model logrModel <- glm(cut ~ price + color + clarity + depth, data = trainingDataSub, family = "binomial") # Print summary of the trained model summary(logrModel)
# Generate predictions using the trained model predictionsLogR <- predict(logrModel, newData = testDataSub) # View predictions against label column display(select(predictionsLogR, "label", "prediction"))
Evaluate the model.
errorsLogR <- select(predictionsLogR, predictionsLogR$label, predictionsLogR$prediction, alias(abs(predictionsLogR$label - predictionsLogR$prediction), "error")) display(errorsLogR)