XGBoost Classification with Spark DataFrames

import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType} schema: org.apache.spark.sql.types.StructType = StructType(StructField(item,StringType,true),StructField(sepal length,DoubleType,true),StructField(sepal width,DoubleType,true),StructField(petal length,DoubleType,true),StructField(petal width,DoubleType,true),StructField(class,StringType,true)) rawInput: org.apache.spark.sql.DataFrame = [item: string, sepal length: double ... 4 more fields]

Table

xgbParam: scala.collection.immutable.Map[String,Any] = Map(num_workers -> 2, max_depth -> 2, num_class -> 3, objective -> multi:softprob, num_round -> 100, eta -> 0.1) xgbClassifier: ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier = xgbc_1b698523821f

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.168.5.140, DMLC_TRACKER_PORT=51375, DMLC_NUM_WORKER=2} xgbClassificationModel: ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel = xgbc_1b698523821f

results: org.apache.spark.sql.DataFrame = [features: vector, classIndex: double ... 3 more fields]

Table

command-1850871038733709:6: warning: method labels in class StringIndexerModel is deprecated (since 3.0.0): `labels` is deprecated and will be removed in 3.1.0. Use `labelsArray` instead. .setLabels(stringIndexer.labels) ^ import org.apache.spark.ml.feature._ labelConverter: org.apache.spark.ml.feature.IndexToString = idxToStr_ac19b6339e06

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.168.5.140, DMLC_TRACKER_PORT=39903, DMLC_NUM_WORKER=2} import org.apache.spark.ml.Pipeline pipeline: org.apache.spark.ml.Pipeline = pipeline_9659fd98e93b model: org.apache.spark.ml.PipelineModel = pipeline_9659fd98e93b

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator evaluator: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = MulticlassClassificationEvaluator: uid=mcEval_9941644e264f, metricName=f1, metricLabel=0.0, beta=1.0, eps=1.0E-15 prediction: org.apache.spark.sql.DataFrame = [sepal length: double, sepal width: double ... 9 more fields] accuracy: Double = 0.9650984224486947

The model accuracy is : 0.9650984224486947

Table

xgboost-classification(Scala)

XGBoost Classification with Spark DataFrames

Prepare Data

Train XGBoost Model with Spark DataFrames

Embed XGBoost in ML Pipeline