%md # Regression with gradient-boosted trees and MLlib pipelines
This notebook uses a bike-sharing dataset to illustrate MLlib pipelines and the gradient-boosted trees machine learning algorithm. The challenge is to predict the number of bicycle rentals per hour based on the features available in the dataset such as day of the week, weather, season, and so on. Demand prediction is a common problem across businesses; good predictions allow a business or service to optimize inventory and to match supply and demand to make customers happy and maximize profitability.
%md ## Load the dataset
The dataset is from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset) and is provided with Databricks Runtime. The dataset includes information about bicycle rentals from the Capital bikeshare system in 2011 and 2012.
Load the data using the CSV datasource for Spark, which creates a [Spark DataFrame](http://spark.apache.org/docs/latest/sql-programming-guide.html).
Load the dataset
The dataset is from the UCI Machine Learning Repository and is provided with Databricks Runtime. The dataset includes information about bicycle rentals from the Capital bikeshare system in 2011 and 2012.
Load the data using the CSV datasource for Spark, which creates a Spark DataFrame.
Last refresh: Never
df = spark.read.csv("/databricks-datasets/bikeSharing/data-001/hour.csv", header="true", inferSchema="true")
# The following command caches the DataFrame in memory. This improves performance since subsequent calls to the DataFrame can read from memory instead of re-reading the data from disk.
df.cache()
%md #### Data description
The following columns are included in the dataset:
**Index column**:
* instant: record index
**Feature columns**:
* dteday: date
* season: season (1:spring, 2:summer, 3:fall, 4:winter)
* yr: year (0:2011, 1:2012)
* mnth: month (1 to 12)
* hr: hour (0 to 23)
* holiday: 1 if holiday, 0 otherwise
* weekday: day of the week (0 to 6)
* workingday: 0 if weekend or holiday, 1 otherwise
* weathersit: (1:clear, 2:mist or clouds, 3:light rain or snow, 4:heavy rain or snow)
* temp: normalized temperature in Celsius
* atemp: normalized feeling temperature in Celsius
* hum: normalized humidity
* windspeed: normalized wind speed
**Label columns**:
* casual: count of casual users
* registered: count of registered users
* cnt: count of total rental bikes including both casual and registered
Data description
The following columns are included in the dataset:
Index column:
- instant: record index
Feature columns:
- dteday: date
- season: season (1:spring, 2:summer, 3:fall, 4:winter)
- yr: year (0:2011, 1:2012)
- mnth: month (1 to 12)
- hr: hour (0 to 23)
- holiday: 1 if holiday, 0 otherwise
- weekday: day of the week (0 to 6)
- workingday: 0 if weekend or holiday, 1 otherwise
- weathersit: (1:clear, 2:mist or clouds, 3:light rain or snow, 4:heavy rain or snow)
- temp: normalized temperature in Celsius
- atemp: normalized feeling temperature in Celsius
- hum: normalized humidity
- windspeed: normalized wind speed
Label columns:
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
Last refresh: Never
%md Call `display()` on a DataFrame to see a sample of the data. The first row shows that 16 people rented bikes between midnight and 1am on January 1, 2011.
Call display()
on a DataFrame to see a sample of the data. The first row shows that 16 people rented bikes between midnight and 1am on January 1, 2011.
Last refresh: Never
display(df)
Showing the first 1000 rows.
Last refresh: Never
%md ## Preprocess data
This dataset is well prepared for machine learning algorithms. The numeric input columns (temp, atemp, hum, and windspeed) are normalized, categorial values (season, yr, mnth, hr, holiday, weekday, workingday, weathersit) are converted to indices, and all of the columns except for the date (`dteday`) are numeric.
The goal is to predict the count of bike rentals (the `cnt` column). Reviewing the dataset, you can see that some columns contain duplicate information. For example, the `cnt` column equals the sum of the `casual` and `registered` columns. You should remove the `casual` and `registered` columns from the dataset. The index column `instant` is also not useful as a predictor.
You can also delete the column `dteday`, as this information is already included in the other date-related columns `yr`, `mnth`, and `weekday`.
Preprocess data
This dataset is well prepared for machine learning algorithms. The numeric input columns (temp, atemp, hum, and windspeed) are normalized, categorial values (season, yr, mnth, hr, holiday, weekday, workingday, weathersit) are converted to indices, and all of the columns except for the date (dteday
) are numeric.
The goal is to predict the count of bike rentals (the cnt
column). Reviewing the dataset, you can see that some columns contain duplicate information. For example, the cnt
column equals the sum of the casual
and registered
columns. You should remove the casual
and registered
columns from the dataset. The index column instant
is also not useful as a predictor.
You can also delete the column dteday
, as this information is already included in the other date-related columns yr
, mnth
, and weekday
.
Last refresh: Never
df = df.drop("instant").drop("dteday").drop("casual").drop("registered")
display(df)
Showing the first 1000 rows.
Last refresh: Never
Regression with gradient-boosted trees and MLlib pipelines
This notebook uses a bike-sharing dataset to illustrate MLlib pipelines and the gradient-boosted trees machine learning algorithm. The challenge is to predict the number of bicycle rentals per hour based on the features available in the dataset such as day of the week, weather, season, and so on. Demand prediction is a common problem across businesses; good predictions allow a business or service to optimize inventory and to match supply and demand to make customers happy and maximize profitability.
Last refresh: Never