Regression with gradient-boosted trees and MLlib pipelines

This notebook uses a bike-sharing dataset to illustrate MLlib pipelines and the gradient-boosted trees machine learning algorithm. The challenge is to predict the number of bicycle rentals per hour based on the features available in the dataset such as day of the week, weather, season, and so on. Demand prediction is a common problem across businesses; good predictions allow a business or service to optimize inventory and to match supply and demand to make customers happy and maximize profitability.

df = spark.read.csv("/databricks-datasets/bikeSharing/data-001/hour.csv", header="true", inferSchema="true")# The following command caches the DataFrame in memory. This improves performance since subsequent calls to the DataFrame can read from memory instead of re-reading the data from disk.df.cache()

Out[1]: DataFrame[instant: int, dteday: string, season: int, yr: int, mnth: int, hr: int, holiday: int, weekday: int, workingday: int, weathersit: int, temp: double, atemp: double, hum: double, windspeed: double, casual: int, registered: int, cnt: int]

Data description

The following columns are included in the dataset:

Index column:

instant: record index

Feature columns:

dteday: date
season: season (1:spring, 2:summer, 3:fall, 4:winter)
yr: year (0:2011, 1:2012)
mnth: month (1 to 12)
hr: hour (0 to 23)
holiday: 1 if holiday, 0 otherwise
weekday: day of the week (0 to 6)
workingday: 0 if weekend or holiday, 1 otherwise
weathersit: (1:clear, 2:mist or clouds, 3:light rain or snow, 4:heavy rain or snow)
temp: normalized temperature in Celsius
atemp: normalized feeling temperature in Celsius
hum: normalized humidity
windspeed: normalized wind speed

Label columns:

casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

display(df)

instant

dteday

season

mnth

holiday

weekday

workingday

weathersit

temp

atemp

hum

windspeed

casual

registered

cnt

2011-01-01

0.24

0.2879

0.81

2011-01-01

0.22

0.2727

0.8

2011-01-01

0.22

0.2727

0.8

2011-01-01

0.24

0.2879

0.75

2011-01-01

0.24

0.2879

0.75

2011-01-01

0.24

0.2576

0.75

0.0896

2011-01-01

0.22

0.2727

0.8

2011-01-01

0.2

0.2576

0.86

2011-01-01

0.24

0.2879

0.75

2011-01-01

0.32

0.3485

0.76

2011-01-01

0.38

0.3939

0.76

0.2537

2011-01-01

0.36

0.3333

0.81

0.2836

2011-01-01

0.42

0.4242

0.77

0.2836

2011-01-01

0.46

0.4545

0.72

0.2985

2011-01-01

0.46

0.4545

0.72

0.2836

106

2011-01-01

0.44

0.4394

0.77

0.2985

110

2011-01-01

0.42

0.4242

0.82

0.2985

2011-01-01

0.44

0.4394

0.82

0.2836

Showing the first 1000 rows.

print("The dataset has %d rows." % df.count())

The dataset has 17379 rows.

Preprocess data

This dataset is well prepared for machine learning algorithms. The numeric input columns (temp, atemp, hum, and windspeed) are normalized, categorial values (season, yr, mnth, hr, holiday, weekday, workingday, weathersit) are converted to indices, and all of the columns except for the date (dteday) are numeric.

The goal is to predict the count of bike rentals (the cnt column). Reviewing the dataset, you can see that some columns contain duplicate information. For example, the cnt column equals the sum of the casual and registered columns. You should remove the casual and registered columns from the dataset. The index column instant is also not useful as a predictor.

You can also delete the column dteday, as this information is already included in the other date-related columns yr, mnth, and weekday.

df = df.drop("instant").drop("dteday").drop("casual").drop("registered")display(df)

season

mnth

holiday

weekday

workingday

weathersit

temp

atemp

hum

windspeed

cnt

0.24

0.2879

0.81

0.22

0.2727

0.8

0.22

0.2727

0.8

0.24

0.2879

0.75

0.24

0.2879

0.75

0.24

0.2576

0.75

0.0896

0.22

0.2727

0.8

0.2

0.2576

0.86

0.24

0.2879

0.75

0.32

0.3485

0.76

0.38

0.3939

0.76

0.2537

0.36

0.3333

0.81

0.2836

0.42

0.4242

0.77

0.2836

0.46

0.4545

0.72

0.2985

0.46

0.4545

0.72

0.2836

106

0.44

0.4394

0.77

0.2985

110

0.42

0.4242

0.82

0.2985

0.44

0.4394

0.82

0.2836

Showing the first 1000 rows.

df.printSchema()

# Split the dataset randomly into 70% for training and 30% for testing. Passing a seed for deterministic behaviortrain, test = df.randomSplit([0.7, 0.3], seed = 0)print("There are %d training examples and %d test examples." % (train.count(), test.count()))

There are 12081 training examples and 5298 test examples.

Visualize the data

You can plot the data to explore it visually. The following plot shows the number of bicycle rentals during each hour of the day. As you might expect, rentals are low during the night, and peak at commute hours.

To create plots, call display() on a DataFrame in Databricks and click the plot icon below the table.

To create the plot shown, run the command in the following cell. The results appear in a table. From the drop-down menu below the table, select "Line". Click Plot Options.... In the dialog, drag hr to the Keys field, and drag cnt to the Values field. Also in the Keys field, click the "x" next to <id> to remove it. In the Aggregation drop down, select "AVG".

display(train.select("hr", "cnt"))

Regression with gradient-boosted trees and MLlib pipelines

Load the dataset

Data description

Preprocess data

Split data into training and test sets

Visualize the data