binary-classification(Python)

Loading...

Binary Classification Example

The Spark MLlib Pipelines API provides higher-level API built on top of DataFrames for constructing ML pipelines. You can read more about the Pipelines API in the MLlib programming guide.

Binary Classification is the task of predicting a binary label. For example, is an email spam or not spam? Should I show this ad to this user or not? Will it rain tomorrow or not? This notebook illustrates algorithms for making these types of predictions.

Dataset Review

The Adult dataset is publicly available at the UCI Machine Learning Repository. This data derives from census data and consists of information about 48842 individuals and their annual income. You can use this information to predict if an individual earns <=50K or >50k a year. The dataset consists of both numeric and categorical variables.

Attribute Information:

  • age: continuous
  • workclass: Private,Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
  • fnlwgt: continuous
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc...
  • education-num: continuous
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent...
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners...
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
  • sex: Female, Male
  • capital-gain: continuous
  • capital-loss: continuous
  • hours-per-week: continuous
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany...

Target/Label: - <=50K, >50K

Load Data

The Adult dataset is available in databricks-datasets. Read in the data using the CSV data source for Spark and rename the columns appropriately.

        Preprocess Data

        To use algorithms like Logistic Regression, you must first convert the categorical variables in the dataset into numeric variables. There are two ways to do this.

        • Category Indexing

          This is basically assigning a numeric value to each category from {0, 1, 2, ...numCategories-1}. This introduces an implicit ordering among your categories, and is more suitable for ordinal variables (eg: Poor: 0, Average: 1, Good: 2)

        • One-Hot Encoding

          This converts categories into binary vectors with at most one nonzero value (eg: (Blue: [1, 0]), (Green: [0, 1]), (Red: [0, 0]))

        This notebook uses a combination of StringIndexer and, depending on your Spark version, either OneHotEncoder or OneHotEncoderEstimator to convert the categorical variables. OneHotEncoder and OneHotEncoderEstimator return a SparseVector.

        Since there is more than one stage of feature transformations, use a Pipeline to tie the stages together. This simplifies the code.

        The above code basically indexes each categorical column using the StringIndexer, and then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row.

        Use the StringIndexer again to encode labels to label indices.

        Use a VectorAssembler to combine all the feature columns into a single vector column. This includes both the numeric columns and the one-hot encoded binary vector columns in the dataset.

        Run the stages as a Pipeline. This puts the data through all of the feature transformations in a single call.

          Fit and Evaluate Models

          Now, try out some of the Binary Classification algorithms available in the Pipelines API.

          Out of these algorithms, the below are also capable of supporting multiclass classification with the Python API:

          • Decision Tree Classifier
          • Random Forest Classifier

          These are the general steps to build the models:

          • Create initial model using the training set
          • Tune parameters with a ParamGrid and 5-fold Cross Validation
          • Evaluate the best model obtained from the Cross Validation using the test set

          Use the BinaryClassificationEvaluator to evaluate the models, which uses areaUnderROC as the default metric.

          Logistic Regression

          You can read more about Logistic Regression from the classification and regression section of MLlib Programming Guide. In the Pipelines API, you can now perform Elastic-Net Regularization with Logistic Regression, as well as other linear methods.

          Use BinaryClassificationEvaluator to evaluate the model.

          Note that the default metric for the BinaryClassificationEvaluator is areaUnderROC

            The evaluator accepts two kinds of metrics - areaUnderROC and areaUnderPR. Set it to areaUnderPR by using evaluator.setMetricName("areaUnderPR").

            Now, tune the model using ParamGridBuilder and CrossValidator.

            You can use explainParams() to print a list of all parameters and their definitions.

              Using three values for regParam, three values for maxIter, and two values for elasticNetParam, the grid includes 3 x 3 x 3 = 27 parameter settings for CrossValidator.

              You can also access the model's feature weights and intercepts.

                Decision Trees

                You can read more about Decision Trees in the Spark MLLib Programming Guide. The Decision Trees algorithm is popular because it handles categorical data and works out of the box with multiclass classification tasks.

                You can extract the number of nodes in the decision tree as well as the tree depth of the model.

                    Evaluate the Decision Tree model with BinaryClassificationEvaluator.

                    Entropy and the Gini coefficient are the supported measures of impurity for Decision Trees. This is Gini by default. Changing this value is simple, model.setImpurity("Entropy").

                      Now tune the model with using ParamGridBuilder and CrossValidator.

                      With three values for maxDepth and three values for maxBin, the grid has 4 x 3 = 12 parameter settings for CrossValidator.

                      Random Forest

                      Random Forests uses an ensemble of trees to improve model accuracy. You can read more about Random Forest from the classification and regression section of MLlib Programming Guide.

                        Evaluate the Random Forest model with BinaryClassificationEvaluator.

                        Now tune the model with ParamGridBuilder and CrossValidator.

                        With three values for maxDepth, two values for maxBin, and two values for numTrees, the grid has 3 x 2 x 2 = 12 parameter settings for CrossValidator.

                        Make Predictions

                        As Random Forest gives the best areaUnderROC value, use the bestModel obtained from Random Forest for deployment, and use it to generate predictions on new data. Instead of new data, this example generates predictions on the entire dataset.

                          This example shows predictions grouped by age and occupation.

                            In an operational environment, analysts may use a similar machine learning pipeline to obtain predictions on new data, organize it into a table and use it for analysis or lead targeting.