Pop. vs. Price LR(Python)

Loading...

Build a linear regression model to predict a city's median home price from its population

Load and preprocess the data

This notebook uses an example dataset of housing prices in different cities. The dataset is included with Databricks. The goal is to build a simple model that predicts the median house price in a city from its population.

# Use the Spark CSV datasource with options specifying:
#  - First line of file is a header
#  - Automatically infer the schema of the data
data = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", header="true", inferSchema="true")
data.cache()  # Cache data for faster reuse
data.count()
Out[1]: 294
display(data)
 
2014 rank
City
State
State Code
2014 Population estimate
2015 median sales price
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
101
Birmingham
Alabama
AL
212247
162.9
125
Huntsville
Alabama
AL
188226
157.7
122
Mobile
Alabama
AL
194675
122.5
114
Montgomery
Alabama
AL
200481
129
64
Anchorage[19]
Alaska
AK
301010
null
78
Chandler
Arizona
AZ
254276
null
86
Gilbert[20]
Arizona
AZ
239277
null
88
Glendale
Arizona
AZ
237517
null
38
Mesa
Arizona
AZ
464704
null
148
Peoria
Arizona
AZ
166934
null
6
Phoenix
Arizona
AZ
1537058
206.1
95
Scottsdale
Arizona
AZ
230512
null
215
Surprise
Arizona
AZ
126275
null
142
Tempe
Arizona
AZ
172816
null
33
Tucson
Arizona
AZ
527972
178.1
119
Little Rock
Arkansas
AR
197706
131.8
56
Anaheim
California
CA
346997
685.7
261
Antioch
California
CA
108930
null

Showing all 294 rows.

# Drop rows with missing values
data = data.dropna() 
data.count()
Out[3]: 109

Some of the column names contain spaces, which are not compatible with PySpark. Rename the columns to replace spaces with underscores and shorten the names.

from pyspark.sql.functions import col
 
exprs = [col(column).alias(column.replace(' ', '_')) for column in data.columns]
 
vdata = data.select(*exprs).selectExpr("2014_Population_estimate as population", "2015_median_sales_price as label")
display(vdata)
 
population
label
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
212247
162.9
188226
157.7
194675
122.5
200481
129
1537058
206.1
527972
178.1
197706
131.8
346997
685.7
3928864
434.7
319504
281
485199
275.8
1381069
510.3
852469
748.3
1015785
900
105112
442.2
445830
220.1
663862
338.1
147612
342.7
124705
202.6
130282
205.1
658893
367.8
169854
208
128460
169.9
853382
182.5
102346
136
430332
269.1
105838
145

Showing all 109 rows.

Visualize the data

To get a feel for the data, you can plot it using matplotlib.

import pandas as pd
import matplotlib.pyplot as plt
 
pandas_vdata = vdata.toPandas()
 
x = pandas_vdata.population
y = pandas_vdata.label
 
plt.style.use('classic')
plt.rcParams['lines.linewidth'] = 0
fig, ax = plt.subplots()
ax.loglog(x,y)
plt.xlim(1.0e5, 1.0e7)
plt.ylim(5.0e1, 1.0e3)
ax.scatter(x, y, c="blue")
 

Prepare the data for use with PySpark

PySpark machine learning algorithms require a single features column and a label column. Each row in the features column contains a vector of data points corresponding to the features used for prediction. The labels column contains the output label for each data point.

Use the VectorAssembler method to create the features column from the input data.

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
 
stages = []
assembler = VectorAssembler(inputCols=["population"], outputCol="features")
stages += [assembler]
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(vdata)
dataset = pipelineModel.transform(vdata)
# Keep relevant columns
selectedcols = ["features", "label"]
dataset.select(selectedcols).show()
 
+-----------+-----+ | features|label| +-----------+-----+ | [212247.0]|162.9| | [188226.0]|157.7| | [194675.0]|122.5| | [200481.0]|129.0| |[1537058.0]|206.1| | [527972.0]|178.1| | [197706.0]|131.8| | [346997.0]|685.7| |[3928864.0]|434.7| | [319504.0]|281.0| | [485199.0]|275.8| |[1381069.0]|510.3| | [852469.0]|748.3| |[1015785.0]|900.0| | [105112.0]|442.2| | [445830.0]|220.1| | [663862.0]|338.1| | [147612.0]|342.7| | [124705.0]|202.6| | [130282.0]|205.1| +-----------+-----+ only showing top 20 rows

Build the linear regression model

Goal

  • Predict y = 2015 median housing price
  • Using feature x = 2014 population estimate

References

First, import the LinearRegression class and instantiate a linear regression model.

# Import LinearRegression class
from pyspark.ml.regression import LinearRegression
# Create a linear regression object
lr = LinearRegression()

This notebook creates and compares two models, which use different values for the regularization parameter.

# Fit 2 models, using different regularization parameters
modelA = lr.fit(dataset, {lr.regParam:0.0})
modelB = lr.fit(dataset, {lr.regParam:100.0})
print(">>>> ModelA intercept: %r, coefficient: %r" % (modelA.intercept, modelA.coefficients[0]))
print(">>>> ModelB intercept: %r, coefficient: %r" % (modelB.intercept, modelB.coefficients[0]))
>>>> ModelA intercept: 191.29427575139394, coefficient: 3.779789682338248e-05 >>>> ModelB intercept: 199.85112564667153, coefficient: 2.1603499483717156e-05

Make predictions

Use the transform() method on the model to generate predictions. The following code takes the first model (modelA) and creates a new table (predictionsA) containing both the label (original sales price) and the prediction (predicted sales price) based on the features (population).

predictionsA = modelA.transform(dataset)
predictionsA.show()
+----------+-----+-----------+------------------+ |population|label| features| prediction| +----------+-----+-----------+------------------+ | 212247|162.9| [212247.0]| 199.3167659584664| | 188226|157.7| [188226.0]|198.40882267887193| | 194675|122.5| [194675.0]|198.65258131548592| | 200481|129.0| [200481.0]|198.87203590444247| | 1537058|206.1|[1537058.0]|249.39183544694856| | 527972|178.1| [527972.0]|211.25050693302884| | 197706|131.8| [197706.0]| 198.7671467407576| | 346997|685.7| [346997.0]| 204.4100325554172| | 3928864|434.7|[3928864.0]|339.79707185649573| | 319504|281.0| [319504.0]|203.37085497805194| | 485199|275.8| [485199.0]|209.63377749220228| | 1381069|510.3|[1381069.0]|243.49577931936597| | 852469|748.3| [852469.0]|223.51581105852597| | 1015785|900.0|[1015785.0]| 229.6888123761335| | 105112|442.2| [105112.0]|195.26728828229332| | 445830|220.1| [445830.0]|208.14571209216254| | 663862|338.1| [663862.0]| 216.3868631323583| | 147612|342.7| [147612.0]|196.87369889728708| | 124705|202.6| [124705.0]|196.00786247475386| | 130282|205.1| [130282.0]|196.21866134533786| +----------+-----+-----------+------------------+ only showing top 20 rows
predictionsB = modelB.transform(dataset)
predictionsB.show()
+----------+-----+-----------+------------------+ |population|label| features| prediction| +----------+-----+-----------+------------------+ | 212247|162.9| [212247.0]|204.43640360159205| | 188226|157.7| [188226.0]|203.91746594049368| | 194675|122.5| [194675.0]|204.05678690866418| | 200481|129.0| [200481.0]|204.18221682666663| | 1537058|206.1|[1537058.0]|233.05695735611485| | 527972|178.1| [527972.0]|211.25716847608865| | 197706|131.8| [197706.0]|204.12226711559933| | 346997|685.7| [346997.0]|207.34747515702293| | 3928864|434.7|[3928864.0]|284.72833704226645| | 319504|281.0| [319504.0]| 206.7535301457171| | 485199|275.8| [485199.0]| 210.3331219926716| | 1381069|510.3|[1381069.0]| 229.6870490751493| | 852469|748.3| [852469.0]| 218.2674392480564| | 1015785|900.0|[1015785.0]|221.79563636973916| | 105112|442.2| [105112.0]| 202.121912684404| | 445830|220.1| [445830.0]|209.48261382149715| | 663862|338.1| [663862.0]|214.19286802093097| | 147612|342.7| [147612.0]| 203.040061412462| | 124705|202.6| [124705.0]| 202.5451900497885| | 130282|205.1| [130282.0]|202.66567276640916| +----------+-----+-----------+------------------+ only showing top 20 rows

Evaluate the model

You can evaluate the model's performance by calculating the root mean square error between the value predicted by the model (in the prediction column) and the actual value (in the label column). Use the PySpark RegressionEvaluator. In this example modelA performed slightly better.

from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse")
RMSE_modelA = evaluator.evaluate(predictionsA)
RMSE_modelB = evaluator.evaluate(predictionsB)
print("ModelA: Root Mean Squared Error = " + str(RMSE_modelA))
print("ModelB: Root Mean Squared Error = " + str(RMSE_modelB))
ModelA: Root Mean Squared Error = 128.60202684284758 ModelB: Root Mean Squared Error = 129.49630019270606

Plot residuals versus fitted values

Residual analysis is an important step in evaluating a model's performance. Ideally, the model's residuals -- the difference between the predicted value and the actual value -- should be small and symmetric around 0. In this case there are some very large residuals and poor symmetry around 0. This is not surprising; housing prices depend on many more variables than a city's population.

display(modelA, dataset)
-100-50.00.0050.0100150200250300350400450500550600650200250300350400450500fitted valuesresiduals

Linear regression plots

To plot the predicted values and actual values vs population, create a pandas DataFrame with columns for the population, price, and prediction from each model. The Koalas package allows you to use the pandas DataFrame API to access data in Apache Spark.

import databricks.koalas as ks
 
# Combine the predictions made by each model into a Koalas DataFrame.
predA = predictionsA.select('prediction')
predB = predictionsB.select('prediction')
 
kdfA = predA.to_koalas().rename(columns={"prediction": "pred_ModelA"})
kdfB = predB.to_koalas().rename(columns={"prediction": "pred_ModelB"})  
 
kdfpreds = ks.concat([kdfA, kdfB], axis = 1)
 
# Create a new Koalas DataFrame containing the original population and price data.
selectedcols = ["population", "label"]
 
kdf = dataset.select(selectedcols).to_koalas()
 
# Combine the two DataFrames into one that contains the original data and the predictions from each model.
kdf2 = ks.concat([kdf, kdfpreds], axis = 1)
display(kdf2)
 
 
population
label
pred_ModelA
pred_ModelB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
105838
145
195.2947295553871
202.1375968250292
358699
156
204.8523435440444
207.6002793079814
209983
118.9
199.23119152005827
204.3874932787609
130282
205.1
196.21866134533786
202.66567276640916
470800
153
209.0895255758424
210.02205320360557
212247
162.9
199.3167659584664
204.43640360159205
128460
169.9
196.14979357732565
202.62631119034984
346997
685.7
204.4100325554172
207.34747515702293
389521
105.9
206.0173503199347
208.26614236906852
115828
105.2
195.67233054465268
202.35341578487152
183016
226.9
198.2118956364221
203.8049117081835
197254
144.4
198.75006209139343
204.11250233383268
446599
148.1
208.17477867481972
209.49922691260014
216282
178.7
199.46928047214874
204.52357372200885
310797
141.2
203.04174869041074
206.56542847571237
161637
193.9
197.403814400235
203.34305049272112
350399
699.3
204.53862100041033
207.42097026226654
129195
156.9
196.17757503149085
202.64218976247037

Showing all 109 rows.

Convert to pandas DataFrame and display the result

pydf = kdf2.to_pandas()
pydf
Out[14]:

Create a scatterplot of the data and the predicted values from the two models

This code uses the matplotlib library imported in Cmd 11 and also imports the seaborn package.

import seaborn as sns
f, ax = plt.subplots()
ax.set(xscale="log", yscale="log")
plt.xlim(1.0e5, 1.0e7)
plt.ylim(5.0e1, 1.0e3)
 
ax.scatter(x, y, c="blue")
 
sns.regplot(x="population",
            y="pred_ModelB", 
            data=pydf,
            color='g')
 
sns.regplot(x="population",
            y="pred_ModelA", 
            data=pydf,
            color='r')
 
ax.set(ylabel = "price")