The Spark MLlib Pipelines API provides higher-level API built on top of DataFrames for constructing ML pipelines. You can read more about the Pipelines API in the MLlib programming guide.
Binary Classification is the task of predicting a binary label. For example, is an email spam or not spam? Should I show this ad to this user or not? Will it rain tomorrow or not? This notebook illustrates algorithms for making these types of predictions.
The Adult dataset is publicly available at the UCI Machine Learning Repository. This data derives from census data and consists of information about 48842 individuals and their annual income. You can use this information to predict if an individual earns <=50K or >50k a year. The dataset consists of both numeric and categorical variables.
Attribute Information:
- age: continuous
- workclass: Private,Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
- fnlwgt: continuous
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc...
- education-num: continuous
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent...
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners...
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
- sex: Female, Male
- capital-gain: continuous
- capital-loss: continuous
- hours-per-week: continuous
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany...
Target/Label: - <=50K, >50K
Preprocess Data
To use algorithms like Logistic Regression, you must first convert the categorical variables in the dataset into numeric variables. There are two ways to do this.
Category Indexing
This is basically assigning a numeric value to each category from {0, 1, 2, ...numCategories-1}. This introduces an implicit ordering among your categories, and is more suitable for ordinal variables (eg: Poor: 0, Average: 1, Good: 2)
One-Hot Encoding
This converts categories into binary vectors with at most one nonzero value (eg: (Blue: [1, 0]), (Green: [0, 1]), (Red: [0, 0]))
This notebook uses a combination of StringIndexer and, depending on your Spark version, either OneHotEncoder or OneHotEncoderEstimator to convert the categorical variables.
OneHotEncoder
and OneHotEncoderEstimator
return a SparseVector.
Since there is more than one stage of feature transformations, use a Pipeline to tie the stages together. This simplifies the code.
Fit and Evaluate Models
Now, try out some of the Binary Classification algorithms available in the Pipelines API.
Out of these algorithms, the below are also capable of supporting multiclass classification with the Python API:
- Decision Tree Classifier
- Random Forest Classifier
These are the general steps to build the models:
- Create initial model using the training set
- Tune parameters with a
ParamGrid
and 5-fold Cross Validation - Evaluate the best model obtained from the Cross Validation using the test set
Use the BinaryClassificationEvaluator
to evaluate the models, which uses areaUnderROC as the default metric.
Logistic Regression
You can read more about Logistic Regression from the classification and regression section of MLlib Programming Guide. In the Pipelines API, you can now perform Elastic-Net Regularization with Logistic Regression, as well as other linear methods.
Decision Trees
You can read more about Decision Trees in the Spark MLLib Programming Guide. The Decision Trees algorithm is popular because it handles categorical data and works out of the box with multiclass classification tasks.
Random Forest
Random Forests uses an ensemble of trees to improve model accuracy. You can read more about Random Forest from the classification and regression section of MLlib Programming Guide.
Binary Classification Example