Last week we worked on our data transformation skills to figure out a way of adding location to our model, clustering our zipcode data by the residuals from our previous model.
It was important to find a way of including location in our model, as without it our model coefficients could encourage invalid conclusions. Compare the two models below, with and without zipcode.
dat <- readRDS("data/train.rds")
mod1 <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType, data = dat)
dat <- cbind(dat, residuals = resid(mod1))
zip_group_res <- dat %>%
group_by(ZipCode) %>%
summarise(med_price = median(residuals),
count = n()) %>%
arrange(med_price) %>%
mutate(cumul_count = cumsum(count),
ZipGroup_r = ntile(cumul_count, 5))
dat <- dat %>%
left_join(select(zip_group_res, ZipCode, ZipGroup_r), by = "ZipCode")
mod2 <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType + as.factor(ZipGroup_r), data = dat)
stargazer(mod1, mod2, type = "html")
Dependent variable: | ||
AdjSalePrice | ||
(1) | (2) | |
SqFtTotLiving | 224.598*** | 212.723*** |
(4.376) | (3.905) | |
SqFtLot | -0.103 | 0.449*** |
(0.065) | (0.059) | |
Bathrooms | -17,597.970*** | 4,132.692 |
(4,038.929) | (3,616.224) | |
Bedrooms | -49,451.780*** | -39,190.020*** |
(2,676.447) | (2,389.909) | |
BldgGrade | 109,927.800*** | 99,180.840*** |
(2,602.351) | (2,332.225) | |
PropertyTypeSingle Family | -86,434.480*** | 19,296.490 |
(17,715.780) | (15,847.160) | |
PropertyTypeTownhouse | -115,486.600*** | -77,653.880*** |
(19,305.070) | (17,208.400) | |
as.factor(ZipGroup_r)2 | 54,161.680*** | |
(5,308.167) | ||
as.factor(ZipGroup_r)3 | 118,471.100*** | |
(5,277.623) | ||
as.factor(ZipGroup_r)4 | 175,189.200*** | |
(5,331.375) | ||
as.factor(ZipGroup_r)5 | 335,018.400*** | |
(5,000.209) | ||
Constant | -452,803.400*** | -679,859.100*** |
(23,750.630) | (21,661.970) | |
Observations | 20,340 | 20,340 |
R2 | 0.541 | 0.636 |
Adjusted R2 | 0.541 | 0.636 |
Residual Std. Error | 262,265.800 (df = 20332) | 233,606.900 (df = 20328) |
F Statistic | 3,430.001*** (df = 7; 20332) | 3,232.831*** (df = 11; 20328) |
Note: | p<0.1; p<0.05; p<0.01 |
We can see that in the second model, the sign of SqFtLot
and Bathrooms
has switched from negative to positive.
Location is an example of a confounding variable: it has a
substantial effect on price, and when we leave it out it can cause our
other variables to assume strange coefficients.
Confounding variables present a similar kind of problem to the one we encountered two weeks ago when dealing just with house size and number of bedrooms/bathrooms. Then, we were faced with the problem of correlated predictors: we saw that when we included both size and number of bedrooms in a property, the coefficient for bedrooms became negative (as is still the case in the models above). This is because larger houses often have more bedrooms, and it is the size of the house which drives the price, not the number of bedrooms. If you consider two houses of the same size, the one with more bedrooms is likely to be less valuable, as more rooms are being squashed into the same space. This is what the negative coefficient for bedrooms is telling us.
As we add more variables to our model, the interpretation of the model can become more complex, and our ability to intuitively grasp what is going on inside the model becomes harder. This week, we will look at a few ways of visualising more complex models, using this information to diagnose issues in these models, and possible ways of resolving them.
Today I’ll show you a few different things: don’t worry about using all of them in the housing project, the goal is to round out your skillset, and provide you with some tools for the future. We’ll look at:
A few of you ended up last week’s class with visualisations looking a little like this:
ggplot(dat, aes(SqFtTotLiving, AdjSalePrice, group = ZipGroup_r)) +
geom_point(aes(colour = ZipGroup_r)) +
geom_smooth(method = "lm", aes(colour = ZipGroup_r))
## `geom_smooth()` using formula = 'y ~ x'
A common problem when we have more than two continuous variables is
how to visualise our model. In the plot above, ZipGroup_r
-
our zipgroup clustering based on residuals - is added as a continuous
variable, and what we’re seeing when ggplot
tries to add
zipgroup to the model is a sort of two-dimensional rendering of a 3D
plane.
plot_ly(data = dat, z = ~AdjSalePrice, x = ~SqFtTotLiving, y = ~ZipGroup_r,
color = ~as.factor(ZipGroup_r))
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Here’s a slightly better visualisation using the plotly
package. We can add a regression plane to this plot, but the process is
a little complex. If you’re interested, this stack
overflow page contains the code. Packages like plot_ly
can be useful for data communication, but they require a html file to
render, and can be a bit gimmicky. Use them with care.
What if we have more than three dimensions though? We’ll need to take a different approach.
When dealing with higher dimensions, we need to start thinking in terms of our model as a whole. There are a couple of ways of plotting which allow us to see the partial effects of specific predictor variables, taking into effect all other predictor variables. One of these is a partial regression plot.
We used residuals, i.e. the bits left over from our model’s prediction, to create our zipcode cluster last week. Residuals can help us to see in which ranges of our data our model is good at predicting, and which areas it isn’t.
Let’s see what I mean with a quick residual plot of a bivariate regression of property size and sale price.
mod3 <- lm(AdjSalePrice ~ SqFtTotLiving, data = dat) # bivariate regression model
scatter.smooth(dat$SqFtTotLiving, resid(mod3), # plot a smooth line on the scatter plot
lpars = list(col = "blue", lwd = 3, lty = 3),
main = "Residual Plot (Sale Price ~ Size)",
xlab = "Total Living Area (sq.ft.)",
ylab = "Residuals")
abline(h = 0, col = "red") # plot a horizontal line through zero
Here, I’ve added a smooth line through the residuals to see where my residuals are, on average, above or below my regression line. Take a look at the plot, and see if you can work out where the model is over and under-predicting.
When we extend our model to multiple variables, we can use residuals
to visualise the effect of one predictor variable against our outcome
variable, taking into account our other predictor variables -
i.e., holding their effect constant. A partial regression
plot (sometimes also called an added variable plot) uses
residuals to partial out the effects of other variables,
allowing us to see something a bit like a bivariate scatter plot. The
car
package allows us to create added variable plots.
par(mfrow = c(1,1))
avPlot(mod2, variable = "SqFtTotLiving")
avPlot(mod2, variable = "BldgGrade")
With a partial regression plot, the slope of the line is always the same as the coefficient of the predictor variable within the overall regression model; the intercept is always zero (these are residuals we’re plotting!)
Partial regression plots help us to see extreme points, and to determine whether our residuals are evenly distributed across the range of our predictor variable - basically, we can interpret them similar to the bivariate residual plot we already encountered.
A disadvantage of partial regression plots is that, unlike in our bivariate residual plots, the x-axis no longer shows us the original value of our predictor variable, but rather the residuals.
By contrast, partial residual plots keep the x axis in terms of the original predictor variable, while on the y axis they plot the residual from the full regression combined with the predicted value from the single predictor. The result is an estimate of the single predictor’s contribution to the outcome.
Let’s see an example using property size as the predictor variable.
terms <- predict(mod2, type = "terms") # extract the individual regression terms from our model for each observation
partial_resid <- resid(mod2) + terms # add the individual regression terms to the residual for each observation
df <- data.frame(SqFtTotLiving = dat[, "SqFtTotLiving"], # create a new data.frame of these vals
Terms = terms[, "SqFtTotLiving"],
PartialResid = partial_resid[, "SqFtTotLiving"])
ggplot(df, aes(SqFtTotLiving, PartialResid)) +
geom_point(alpha = 0.2) +
geom_smooth() +
geom_line(aes(SqFtTotLiving, Terms), colour = "red")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Our partial residual plot is an estimate of the contribution that
SqFtTotLiving
adds to the sales price, taking into account
our other predictors. What’s the advantage of a partial residual plot
compared to a partial regression plot? Here, we can see that our fitted
line (in red) isn’t doing a great job compared with the smooth line that
ggplot
adds. It underestimates the value of small
properties, overestimates mid-sized properties, and underestimates very
large properties. This suggests we may need to use a non-linear term to
model the relationship between AdjSalePrice
and
SqFtTotLiving
. We can’t really see this as clearly in the
partial regression plot.
Using a partial residual plot has shown us that the way we are adding
SqFtTotLiving
to our model may not be the optimal solution.
In fact, it looks like a simple non-linear relationship - a quadratic
polynomial in the form \(x^2\) - might
give us a better fitting line. Here’s one way of doing that in R (note
that we must add the original term separately).
mod4 <- lm(AdjSalePrice ~ SqFtTotLiving + I(SqFtTotLiving^2) + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType + as.factor(ZipGroup_r), data = dat)
stargazer(mod1, mod2, mod4, type = "html")
Dependent variable: | |||
AdjSalePrice | |||
(1) | (2) | (3) | |
SqFtTotLiving | 224.598*** | 212.723*** | -152.108*** |
(4.376) | (3.905) | (6.889) | |
I(SqFtTotLiving2) | 0.058*** | ||
(0.001) | |||
SqFtLot | -0.103 | 0.449*** | 0.212*** |
(0.065) | (0.059) | (0.054) | |
Bathrooms | -17,597.970*** | 4,132.692 | 26,210.840*** |
(4,038.929) | (3,616.224) | (3,335.447) | |
Bedrooms | -49,451.780*** | -39,190.020*** | -13,410.670*** |
(2,676.447) | (2,389.909) | (2,230.847) | |
BldgGrade | 109,927.800*** | 99,180.840*** | 113,365.500*** |
(2,602.351) | (2,332.225) | (2,151.049) | |
PropertyTypeSingle Family | -86,434.480*** | 19,296.490 | 11,690.610 |
(17,715.780) | (15,847.160) | (14,533.680) | |
PropertyTypeTownhouse | -115,486.600*** | -77,653.880*** | -127,005.500*** |
(19,305.070) | (17,208.400) | (15,801.600) | |
as.factor(ZipGroup_r)2 | 54,161.680*** | 59,136.150*** | |
(5,308.167) | (4,868.693) | ||
as.factor(ZipGroup_r)3 | 118,471.100*** | 107,470.900*** | |
(5,277.623) | (4,843.271) | ||
as.factor(ZipGroup_r)4 | 175,189.200*** | 154,257.800*** | |
(5,331.375) | (4,900.960) | ||
as.factor(ZipGroup_r)5 | 335,018.400*** | 317,089.700*** | |
(5,000.209) | (4,594.719) | ||
Constant | -452,803.400*** | -679,859.100*** | -443,813.000*** |
(23,750.630) | (21,661.970) | (20,227.450) | |
Observations | 20,340 | 20,340 | 20,340 |
R2 | 0.541 | 0.636 | 0.694 |
Adjusted R2 | 0.541 | 0.636 | 0.694 |
Residual Std. Error | 262,265.800 (df = 20332) | 233,606.900 (df = 20328) | 214,237.000 (df = 20327) |
F Statistic | 3,430.001*** (df = 7; 20332) | 3,232.831*** (df = 11; 20328) | 3,843.773*** (df = 12; 20327) |
Note: | p<0.1; p<0.05; p<0.01 |
And plotting the new regression line:
terms_poly <- predict(mod4, type = "terms") # extract the individual regression terms from our model for each observation
partial_resid_poly <- resid(mod4) + terms_poly # add the individual regression terms to the residual for each observation
df_poly <- data.frame(SqFtTotLiving = dat[, "SqFtTotLiving"], # create a new data.frame of these vals
Terms = terms_poly[, "I(SqFtTotLiving^2)"],
PartialResid = partial_resid_poly[, "I(SqFtTotLiving^2)"])
ggplot(df_poly, aes(SqFtTotLiving, PartialResid)) +
geom_point(alpha = 0.2) +
geom_smooth() +
geom_line(aes(SqFtTotLiving, Terms), colour = "red")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
As we can see from our new partial residual plot, adding the
polynomial term brings our fitted line much closer to the
ggplot
smooth line. We’ve also improved our \(R^2\) to 0.69, and reduced our residual
standard error by almost another twenty thousand dollars.
Partial regression and partial residual plots can be hard to wrap
your head around. Try to think of them as aids to interpretation and
diagnosis. Another handy tool is R’s built in plot()
function: when you run this on an lm
object, it will give
you some useful diagnosis plots to help you check for outliers and
non-normality. Check the following code -
par(mfrow = c(2, 2)) # we change the graphic device to show 4 plots at once
plot(mod4) # we supply our lm object to plot()
It’s important to set the mfrow = c(2,2)
argument to the
par()
function when plotting lm
objects,
otherwise you’ll only get the first plot. These plots give a lot of
information, some of which can be hard to interpret, but they can be
useful in particular for identifying outliers. Outliers are extreme
values in our data which can cause problems with our regression models.
Sometimes they’re a result of incorrect data entry, sometimes there’s
something else different about them which our available data doesn’t
capture. Either way, we may consider dropping some outliers from our
data if they are having an extreme effect.
Another way to detect outliers is by examining standardised residuals (the residual divided by the standard error of the residuals). We can do this in R as follows:
sresid <- rstandard(mod4) # the rstandard() function extracts standardised residuals
index <- order(sresid) # make an index of standardised residuals
dat[index[1:5], c("AdjSalePrice", "SqFtTotLiving", "SqFtLot", "Bathrooms", "Bedrooms", "BldgGrade")]
## AdjSalePrice SqFtTotLiving SqFtLot Bathrooms Bedrooms BldgGrade
## 10251 443157 7100 12840 8.0 11 9
## 7464 602936 6380 15021 6.5 6 12
## 10272 494299 6680 8779 6.5 11 9
## 17019 656486 6240 11043 4.5 4 11
## 2124 1299659 6880 871200 5.0 4 12
We only have today and half of next week’s class to round off the house price project, so we’ll need to focus on creating a final model and preparing a presentation to the King County Assessor explaining what that model does and why it is the best. You’ll need to work effectively as a team to get all this done in time, so take a minute or two before you start work to delegate tasks.
The remaining work plan is as follows. By the end of today’s class,
you’ll need to fill out the model_template.R
script,
located in the github code
directory, with your team’s
final model: it should include any code for data transformation, as well
as code for running the regression itself. You should upload your
completed file to the discussion board on Blackboard by the end of
today’s class at 14:50. I’ll then run your scripts myself on the test
set, and reveal the results at the end of next week’s class.
As well as the final model, you also need to have your presentations ready by half-way through next week’s class (14:00), so that we have time for presentations. Recall that you need to explain and justify your final model selection in a 5 minute presentation, which should include a short handout (to be shown on screen) of no more than two sides of A4. Remember: your handout and your presentation are both data communication operations, so your visualisations should be informative and understandable. Good luck!