gbt-regression(Python)
Loading...

Regression with gradient-boosted trees and MLlib pipelines

This notebook uses a bike-sharing dataset to illustrate MLlib pipelines and the gradient-boosted trees machine learning algorithm. The challenge is to predict the number of bicycle rentals per hour based on the features available in the dataset such as day of the week, weather, season, and so on. Demand prediction is a common problem across businesses; good predictions allow a business or service to optimize inventory and to match supply and demand to make customers happy and maximize profitability.

Load the dataset

The dataset is from the UCI Machine Learning Repository and is provided with Databricks Runtime. The dataset includes information about bicycle rentals from the Capital bikeshare system in 2011 and 2012.

Load the data using the CSV datasource for Spark, which creates a Spark DataFrame.

df = spark.read.csv("/databricks-datasets/bikeSharing/data-001/hour.csv", header="true", inferSchema="true")
# The following command caches the DataFrame in memory. This improves performance since subsequent calls to the DataFrame can read from memory instead of re-reading the data from disk.
df.cache()
Out[1]: DataFrame[instant: int, dteday: string, season: int, yr: int, mnth: int, hr: int, holiday: int, weekday: int, workingday: int, weathersit: int, temp: double, atemp: double, hum: double, windspeed: double, casual: int, registered: int, cnt: int]

Data description

The following columns are included in the dataset:

Index column:

  • instant: record index

Feature columns:

  • dteday: date
  • season: season (1:spring, 2:summer, 3:fall, 4:winter)
  • yr: year (0:2011, 1:2012)
  • mnth: month (1 to 12)
  • hr: hour (0 to 23)
  • holiday: 1 if holiday, 0 otherwise
  • weekday: day of the week (0 to 6)
  • workingday: 0 if weekend or holiday, 1 otherwise
  • weathersit: (1:clear, 2:mist or clouds, 3:light rain or snow, 4:heavy rain or snow)
  • temp: normalized temperature in Celsius
  • atemp: normalized feeling temperature in Celsius
  • hum: normalized humidity
  • windspeed: normalized wind speed

Label columns:

  • casual: count of casual users
  • registered: count of registered users
  • cnt: count of total rental bikes including both casual and registered

Call display() on a DataFrame to see a sample of the data. The first row shows that 16 people rented bikes between midnight and 1am on January 1, 2011.

display(df)
 
instant
dteday
season
yr
mnth
hr
holiday
weekday
workingday
weathersit
temp
atemp
hum
windspeed
casual
registered
cnt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2011-01-01
1
0
1
0
0
6
0
1
0.24
0.2879
0.81
0
3
13
16
2
2011-01-01
1
0
1
1
0
6
0
1
0.22
0.2727
0.8
0
8
32
40
3
2011-01-01
1
0
1
2
0
6
0
1
0.22
0.2727
0.8
0
5
27
32
4
2011-01-01
1
0
1
3
0
6
0
1
0.24
0.2879
0.75
0
3
10
13
5
2011-01-01
1
0
1
4
0
6
0
1
0.24
0.2879
0.75
0
0
1
1
6
2011-01-01
1
0
1
5
0
6
0
2
0.24
0.2576
0.75
0.0896
0
1
1
7
2011-01-01
1
0
1
6
0
6
0
1
0.22
0.2727
0.8
0
2
0
2
8
2011-01-01
1
0
1
7
0
6
0
1
0.2
0.2576
0.86
0
1
2
3
9
2011-01-01
1
0
1
8
0
6
0
1
0.24
0.2879
0.75
0
1
7
8
10
2011-01-01
1
0
1
9
0
6
0
1
0.32
0.3485
0.76
0
8
6
14
11
2011-01-01
1
0
1
10
0
6
0
1
0.38
0.3939
0.76
0.2537
12
24
36
12
2011-01-01
1
0
1
11
0
6
0
1
0.36
0.3333
0.81
0.2836
26
30
56
13
2011-01-01
1
0
1
12
0
6
0
1
0.42
0.4242
0.77
0.2836
29
55
84
14
2011-01-01
1
0
1
13
0
6
0
2
0.46
0.4545
0.72
0.2985
47
47
94
15
2011-01-01
1
0
1
14
0
6
0
2
0.46
0.4545
0.72
0.2836
35
71
106
16
2011-01-01
1
0
1
15
0
6
0
2
0.44
0.4394
0.77
0.2985
40
70
110
17
2011-01-01
1
0
1
16
0
6
0
2
0.42
0.4242
0.82
0.2985
41
52
93
18
2011-01-01
1
0
1
17
0
6
0
2
0.44
0.4394
0.82
0.2836
15
52
67

Showing the first 1000 rows.

print("The dataset has %d rows." % df.count())
The dataset has 17379 rows.

Preprocess data

This dataset is well prepared for machine learning algorithms. The numeric input columns (temp, atemp, hum, and windspeed) are normalized, categorial values (season, yr, mnth, hr, holiday, weekday, workingday, weathersit) are converted to indices, and all of the columns except for the date (dteday) are numeric.

The goal is to predict the count of bike rentals (the cnt column). Reviewing the dataset, you can see that some columns contain duplicate information. For example, the cnt column equals the sum of the casual and registered columns. You should remove the casual and registered columns from the dataset. The index column instant is also not useful as a predictor.

You can also delete the column dteday, as this information is already included in the other date-related columns yr, mnth, and weekday.

df = df.drop("instant").drop("dteday").drop("casual").drop("registered")
display(df)
 
season
yr
mnth
hr
holiday
weekday
workingday
weathersit
temp
atemp
hum
windspeed
cnt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
0
1
0
0
6
0
1
0.24
0.2879
0.81
0
16
1
0
1
1
0
6
0
1
0.22
0.2727
0.8
0
40
1
0
1
2
0
6
0
1
0.22
0.2727
0.8
0
32
1
0
1
3
0
6
0
1
0.24
0.2879
0.75
0
13
1
0
1
4
0
6
0
1
0.24
0.2879
0.75
0
1
1
0
1
5
0
6
0
2
0.24
0.2576
0.75
0.0896
1
1
0
1
6
0
6
0
1
0.22
0.2727
0.8
0
2
1
0
1
7
0
6
0
1
0.2
0.2576
0.86
0
3
1
0
1
8
0
6
0
1
0.24
0.2879
0.75
0
8
1
0
1
9
0
6
0
1
0.32
0.3485
0.76
0
14
1
0
1
10
0
6
0
1
0.38
0.3939
0.76
0.2537
36
1
0
1
11
0
6
0
1
0.36
0.3333
0.81
0.2836
56
1
0
1
12
0
6
0
1
0.42
0.4242
0.77
0.2836
84
1
0
1
13
0
6
0
2
0.46
0.4545
0.72
0.2985
94
1
0
1
14
0
6
0
2
0.46
0.4545
0.72
0.2836
106
1
0
1
15
0
6
0
2
0.44
0.4394
0.77
0.2985
110
1
0
1
16
0
6
0
2
0.42
0.4242
0.82
0.2985
93
1
0
1
17
0
6
0
2
0.44
0.4394
0.82
0.2836
67

Showing the first 1000 rows.

Print the dataset schema to see the type of each column.

df.printSchema()
root |-- season: integer (nullable = true) |-- yr: integer (nullable = true) |-- mnth: integer (nullable = true) |-- hr: integer (nullable = true) |-- holiday: integer (nullable = true) |-- weekday: integer (nullable = true) |-- workingday: integer (nullable = true) |-- weathersit: integer (nullable = true) |-- temp: double (nullable = true) |-- atemp: double (nullable = true) |-- hum: double (nullable = true) |-- windspeed: double (nullable = true) |-- cnt: integer (nullable = true)

Split data into training and test sets

Randomly split data into training and test sets. By doing this, you can train and tune the model using only the training subset, and then evaluate the model's performance on the test set to get a sense of how the model will perform on new data.

# Split the dataset randomly into 70% for training and 30% for testing. Passing a seed for deterministic behavior
train, test = df.randomSplit([0.7, 0.3], seed = 0)
print("There are %d training examples and %d test examples." % (train.count(), test.count()))
There are 12081 training examples and 5298 test examples.

Visualize the data

You can plot the data to explore it visually. The following plot shows the number of bicycle rentals during each hour of the day. As you might expect, rentals are low during the night, and peak at commute hours.

To create plots, call display() on a DataFrame in Databricks and click the plot icon below the table.

To create the plot shown, run the command in the following cell. The results appear in a table. From the drop-down menu below the table, select "Line". Click Plot Options.... In the dialog, drag hr to the Keys field, and drag cnt to the Values field. Also in the Keys field, click the "x" next to <id> to remove it. In the Aggregation drop down, select "AVG".

display(train.select("hr", "cnt"))

Aggregated (by avg) in the backend.

02468101214161820220100200300400
hrcnt

Train the machine learning pipeline

Now that you have reviewed the data and prepared it as a DataFrame with numeric values, you're ready to train a model to predict future bike sharing rentals.

Most MLlib algorithms require a single input column containing a vector of features and a single target column. The DataFrame currently has one column for each feature. MLlib provides functions to help you prepare the dataset in the required format.

MLlib pipelines combine multiple steps into a single workflow, making it easier to iterate as you develop the model.

In this example, you create a pipeline using the following functions:

  • VectorAssembler: Assembles the feature columns into a feature vector.
  • VectorIndexer: Identifies columns that should be treated as categorical. This is done heuristically, identifying any column with a small number of distinct values as categorical. In this example, the following columns are considered categorical: yr (2 values), season (4 values), holiday (2 values), workingday (2 values), and weathersit (4 values).
  • GBTRegressor: Uses the Gradient-Boosted Trees (GBT) algorithm to learn how to predict rental counts from the feature vectors.
  • CrossValidator: The GBT algorithm has several hyperparameters. This notebook illustrates how to use hyperparameter tuning in Spark. This capability automatically tests a grid of hyperparameters and chooses the best resulting model.

For more information:
VectorAssembler
VectorIndexer

The first step is to create the VectorAssembler and VectorIndexer steps.

from pyspark.ml.feature import VectorAssembler, VectorIndexer
 
# Remove the target column from the input feature set.
featuresCols = df.columns
featuresCols.remove('cnt')
 
# vectorAssembler combines all feature columns into a single feature vector column, "rawFeatures".
vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol="rawFeatures")
 
# vectorIndexer identifies categorical features and indexes them, and creates a new column "features". 
vectorIndexer = VectorIndexer(inputCol="rawFeatures", outputCol="features", maxCategories=4)

Next, define the model.

from pyspark.ml.regression import GBTRegressor
 
# The next step is to define the model training stage of the pipeline. 
# The following command defines a GBTRegressor model that takes an input column "features" by default and learns to predict the labels in the "cnt" column. 
gbt = GBTRegressor(labelCol="cnt")

The third step is to wrap the model you just defined in a CrossValidator stage. CrossValidator calls the GBT algorithm with different hyperparameter settings. It trains multiple models and selects the best one, based on minimizing a specified metric. In this example, the metric is root mean squared error (RMSE).

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
 
# Define a grid of hyperparameters to test:
#  - maxDepth: maximum depth of each decision tree 
#  - maxIter: iterations, or the total number of trees 
paramGrid = ParamGridBuilder()\
  .addGrid(gbt.maxDepth, [2, 5])\
  .addGrid(gbt.maxIter, [10, 100])\
  .build()
 
# Define an evaluation metric.  The CrossValidator compares the true labels with predicted values for each combination of parameters, and calculates this value to determine the best model.
evaluator = RegressionEvaluator(metricName="rmse", labelCol=gbt.getLabelCol(), predictionCol=gbt.getPredictionCol())
 
# Declare the CrossValidator, which performs the model tuning.
cv = CrossValidator(estimator=gbt, evaluator=evaluator, estimatorParamMaps=paramGrid)

Create the pipeline.

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, cv])

Train the pipeline.

Now that you have set up the workflow, you can train the pipeline with a single call.
When you call fit(), the pipeline runs feature processing, model tuning, and training and returns a fitted pipeline with the best model it found. This step takes several minutes.

pipelineModel = pipeline.fit(train)
MLlib will automatically track trials in MLflow. After your tuning fit() call has completed, view the MLflow UI to see logged runs.

Make predictions and evaluate results

The final step is to use the fitted model to make predictions on the test dataset and evaluate the model's performance. The model's performance on the test dataset provides an approximation of how it is likely to perform on new data. For example, if you had weather predictions for the next week, you could predict bike rentals expected during the next week.

Computing evaluation metrics is important for understanding the quality of predictions, as well as for comparing models and tuning parameters.

The transform() method of the pipeline model applies the full pipeline to the input dataset. The pipeline applies the feature processing steps to the dataset and then uses the fitted GBT model to make predictions. The pipeline returns a DataFrame with a new column predictions.

predictions = pipelineModel.transform(test)
display(predictions.select("cnt", "prediction", *featuresCols))
 
cnt
prediction
season
yr
mnth
hr
holiday
weekday
workingday
weathersit
temp
atemp
hum
windspeed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
22
30.459169014126616
1
0
1
0
0
0
0
1
0.04
0.0758
0.57
0.1045
17
72.91218829310101
1
0
1
0
0
0
0
2
0.46
0.4545
0.88
0.2985
7
12.596116880163093
1
0
1
0
0
1
1
2
0.24
0.2273
0.65
0.2239
17
13.814131066825366
1
0
1
0
0
5
1
2
0.2
0.197
0.64
0.194
9
9.634951156982405
1
0
1
0
0
5
1
2
0.2
0.2121
0.75
0.1343
17
21.42830109839664
1
0
1
0
1
1
0
2
0.2
0.197
0.47
0.2239
13
29.296007703247973
1
0
1
1
0
0
0
1
0.04
0.0758
0.57
0.1045
12
15.718949632139577
1
0
1
1
0
0
0
1
0.1
0.0606
0.42
0.4627
17
-57.52076673506861
1
0
1
1
0
0
0
2
0.44
0.4394
0.94
0.2537
2
-5.202123096148016
1
0
1
1
0
1
1
1
0.2
0.1667
0.44
0.4179
7
-0.6895834440548108
1
0
1
1
0
1
1
1
0.22
0.2121
0.64
0.2537
2
0.9973616578200352
1
0
1
1
0
2
1
1
0.16
0.1818
0.59
0.1045
6
-4.215741115457107
1
0
1
1
0
3
1
2
0.16
0.1818
0.86
0.1045
7
2.7480553927857434
1
0
1
1
0
5
1
2
0.2
0.197
0.69
0.2239
3
-3.662862099515235
1
0
1
1
0
5
1
2
0.2
0.2121
0.75
0.1343
40
31.691478689880885
1
0
1
1
0
6
0
1
0.22
0.2727
0.8
0
11
31.79117735281862
1
0
1
2
0
0
0
1
0.16
0.2273
0.8
0
16
36.980066811665395
1
0
1
2
0
0
0
1
0.26
0.2879
0.56
0.0896

Showing the first 1000 rows.

A common way to evaluate the performance of a regression model is the calculate the root mean squared error (RMSE). The value is not very informative on its own, but you can use it to compare different models. CrossValidator determines the best model by selecting the one that minimizes RMSE.

rmse = evaluator.evaluate(predictions)
print("RMSE on our test set: %g" % rmse)
RMSE on our test set: 45.8363

You can also plot the results, as you did the original dataset. In this case, the hourly count of rentals shows a similar shape.

display(predictions.select("hr", "prediction"))

Aggregated (by avg) in the backend.

0246810121416182022050100150200250300350400450
hrprediction

It's also a good idea to examine the residuals, or the difference between the expected result and the predicted value. The residuals should be randomly distributed; if there are any patterns in the residuals, the model may not be capturing something important. In this case, the average residual is about 1, less than 1% of the average value of the cnt column.

import pyspark.sql.functions as F
predictions_with_residuals = predictions.withColumn("residual", (F.col("cnt") - F.col("prediction")))
display(predictions_with_residuals.agg({'residual': 'mean'}))
 
avg(residual)
1
1.0911222027267997

Showing all 1 rows.

Plot the residuals across the hours of the day to look for any patterns. In this example, there are no obvious correlations.

display(predictions_with_residuals.select("hr", "residual"))

Aggregated (by avg) in the backend.

0246810121416182022−10−50510
hrresidual

Improving the model

Here are some suggestions for improving this model:

  • The count of rentals is the sum of registered and casual rentals. These two counts may have different behavior, as frequent cyclists and casual cyclists may rent bikes for different reasons. Try training one GBT model for registered and one for casual, and then add their predictions together to get the full prediction.
  • For efficiency, this notebook used only a few hyperparameter settings. You might be able to improve the model by testing more settings. A good start is to increase the number of trees by setting maxIter=200; this takes longer to train but might more accurate.
  • This notebook used the dataset features as-is, but you might be able to improve performance with some feature engineering. For example, the weather might have more of an impact on the number of rentals on weekends and holidays than on workdays. You could try creating a new feature by combining those two columns. MLlib provides a suite of feature transformers; find out more in the ML guide.