houseProject.rmd

---
title: "House Prices: Advanced Regression Techniques"
author: "Aksel Kaastrup Rasmussen"
date: "20/12/2019"
output: bookdown::pdf_document2
toc: False
subtitle: MAP 553-Regression


references:
- id: erik2017
  title: "House prices: Lasso, XGBoost, and a detailed EDA"
  author:
  - family: "Bruin"
    given: "Erik"
  URL: 'https://www.kaggle.com/erikbruin/house-prices-lasso-xgboost-and-a-detailed-eda'
  publisher: "Kaggle"
  issued:
    year: "2017"

- id: breiman2001
  title: "Random Forests"
  author:
    - family: "Breiman"
      given: "Leo"
  URL: 'https://doi.org/10.1023/A:1010933404324'
  journal: "Machine Learning"
  issued:
    year: "2001"
    month: 10
    
- id: karim2019
  title: "Visual Diagnostics and Model Validation"
  author:
    - family: "Lounici"
      given: "Karim"
  journal: "Note on Moodle"
  issued:
    year: "2019"
    month: 10
    
- id: chen2016xgboost
  title: "Xgboost: A scalable tree boosting system"
  author: 
    - family: "Chen, Tianqi and Guestrin, Carlos"
  #booktitle: "Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining",
  pages: "785--794"
  issued:
    year: "2016"
  organization: "ACM"
---

# Introduction. 
We are given a dataset which consists of information of a large number of residential homes in Ames, Iowa. Our goal is not only to estimate the sale price of the homes as accurately as possible, but also to gain information about the given variables; what really gets people's wallet out of their pocket when considering buying a house? We will face the challenge of dimensionality - we have a large dataset with lots of variables, and we will face the problem of choosing a good model for which the assumptions of the data is accurate. The problem here is rather how to make the data satisfy the assumptions without compromising the information. Our research hypothesis is that we with a linear model and data modelling can describe the sale price of the homes reasonably well. Furthermore we believe we will find that more advanced models will perform better, although simple models are easier to interpret. 

# Exploratory Data Analysis/Inital Modeling. 


```{r, libraries, message=FALSE, warning=FALSE, echo=FALSE}
## Packages load, data load, cleaning and factorization of categorical values

knitr::opts_chunk$set(echo = TRUE)
options(tinytex.verbose = TRUE)


# Load packages
library(knitr)
library(ggplot2)
library(plyr)
library(dplyr)
library(corrplot)
library(caret)
library(gridExtra)
library(scales)
library(Rmisc)
library(ggrepel)
library(randomForest)
library(psych)
library(multcomp)
library(stringr)
library(broom)
library(MASS)
library(leaps)
library(glmnet)
library(xgboost)
library(plotmo)
library(kableExtra)
library(pscl)
library(glmulti)

# Load data
load(file = "DataProject.RData")
testPrice = test$SalePrice
# Getting rid of id's and saving test labels
test_labels <- test$Id
test$Id <- NULL
train$Id <- NULL

# Combining train and test in full dataset
test$SalePrice <- NA
full <- rbind(train, test)
```

Our traning and test dataset consists of 1095 and 365 observations respectively with both 67 features. It is clear both cleaning and basic handling of missing values has already taken place, and we will not explore such techniques further here. There has already been taking care of a great deal of ordinal variables using ordinal encoding and normalizing. We will continue this work.

The dataset consists of both numerical, nominal and ordinal variables, and right from the start we will label encode ordinal variables into numbers. This arguably holds for \textit{Street} and \textit{PavedDrive}, since the levels have an order. With this ordinal encoding we ensure the variables keeps the ordered structure. In addition there are lots of categorical \textit{quality} variables, which indeed are ordinal.

We have chosen to remove the variable \textit{Utilities}, since this variable is constant for all training observations, and finally we have split the data into categorical and numerical variables with dimension 36 and 30 respectively. As a start of the exploratory analysis, let us look into the response variable.

```{r early encoding, message=FALSE, warning=FALSE, echo=FALSE}
# Ordinal encoding for Street and PavedDrive
full$Street<-as.integer(recode(full$Street, 'Grvl'=0, 'Pave'=1))
full$PavedDrive<-as.integer(recode(full$PavedDrive, 'N'=0, 'P'=1, 'Y'=2))

## Ordinal encoding for quality and condition variables
# LotShape is ordinal
full$LotShape<-as.integer(recode(full$LotShape,'IR3'=0, 'IR2'=1, 'IR1'=2, 'Reg'=3))

# GarageFinish is ordinal
full$GarageFinish<-as.integer(recode(full$GarageFinish, 'None'=0, 'Unf'=1, 'RFn'=2, 'Fin'=3))

# GarageQual is ordinal 
full$GarageQual<-as.integer(recode(full$GarageQual,  None = 0, Po = 1, Fa = 2, TA = 3, Gd = 4, Ex = 5))

# GarageCond is ordinal 
full$GarageCond<-as.integer(recode(full$GarageCond, None = 0, Po = 1, Fa = 2, TA = 3, Gd = 4, Ex = 5))

# BasementQuality is ordinal
full$BsmtQual<-as.integer(recode(full$BsmtQual, None = 0, Po = 1, Fa = 2, TA = 3, Gd = 4, Ex = 5))

# BasementCondition is ordinal
full$BsmtCond<-as.integer(recode(full$BsmtCond, None = 0, Po = 1, Fa = 2, TA = 3, Gd = 4, Ex = 5))

# BasementExposure is ordinal 
full$BsmtExposure<-as.integer(recode(full$BsmtExposure, 'None'=0, 'No'=1, 'Mn'=2, 'Av'=3, 'Gd'=4))

# Basement Finished Type 1 and 2 are ordinal
full$BsmtFinType1<-as.integer(recode(full$BsmtFinType1, 'None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6))
full$BsmtFinType2<-as.integer(recode(full$BsmtFinType2, 'None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6))

# Masonry veneer type seems to have ordinality
full$MasVnrType<-as.integer(recode(full$MasVnrType, 'None'=0, 'BrkCmn'=0, 'BrkFace'=1, 'Stone'=2))

# KitchenQual is ordinal
full$KitchenQual<-as.integer(recode(full$KitchenQual, None = 0, Po = 1, Fa = 2, TA = 3, Gd = 4, Ex = 5))

# Functionality is ordinal
# From worst to best
full$Functional <- as.integer(recode(full$Functional,'Sal'=0, 'Sev'=1, 'Maj2'=2, 'Maj1'=3, 'Mod'=4, 'Min2'=5, 'Min1'=6, 'Typ'=7))

#ExteriorQuality is ordinal
full$ExterQual<-as.integer(recode(full$ExterQual, None = 0, Po = 1, Fa = 2, TA = 3, Gd = 4, Ex = 5))

#ExteriorCondition is ordinal
full$ExterCond<-as.integer(recode(full$ExterCond, None = 0, Po = 1, Fa = 2, TA = 3, Gd = 4, Ex = 5))

# Factorize and rename MSSubClass
full$MSSubClass <- as.factor(full$MSSubClass)
full$MSSubClass<-recode(full$MSSubClass, '20'='1 story 1946+', '30'='1 story 1945-', '40'='1 story unf attic', '45'='1,5 story unf', '50'='1,5 story fin', '60'='2 story 1946+', '70'='2 story 1945-', '75'='2,5 story all ages', '80'='split/multi level', '85'='split foyer', '90'='duplex all style/age', '120'='1 story PUD 1946+', '150'='1,5 story PUD all', '160'='2 story PUD 1946+', '180'='PUD multilevel', '190'='2 family conversion')

# Remove Utilities
full$Utilities <- NULL

# Splitting data into categorical and numerical variables
varCat <- which(sapply(full, is.factor)) 
varNum <- which(sapply(full, is.numeric))
varNumNames <- names(varNum)
full_varNum <- full[, varNum]

d1 <- length(varCat)
d2 <- length(varNum)

```

## The response variable
We notice the response variable \textit{SalePrice} is right skewed. This is due to the fact that more people can afford a relatively cheap house compared to an expensive house.

```{r fig1, fig.height = 1.8, fig.width = 5, fig.align = "center", message=FALSE, warning=FALSE, echo=FALSE}
ggplot(data=full[!is.na(full$SalePrice),], aes(x=SalePrice)) +
        geom_histogram(fill="steelblue2", binwidth = 10000) +
        scale_x_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
```
This skewness will justify a Box-Cox transform in the early modelling phase in order to satisfy the assumption that the variables are normally distributed. Let us now take a look into the response variables relationship with the numerical variables. Consider figure \@ref(fig:fig2), where we have compared all numerical variables with correlation coefficient above $\frac{1}{2}$ with \textit{SalePrice}.

```{r fig2, fig.height = 4.5, fig.width = 7, fig.align = "center", fig.cap = "Correlation plot of numerical variables of high correlation with SalePrice", message=FALSE, warning=FALSE, echo=FALSE}
# Plot correlations of variables with high correlation to SalePrice
cor_varNum <- cor(full_varNum, use="pairwise.complete.obs")

# Sort
corSort <- as.matrix(sort(cor_varNum[,'SalePrice'], decreasing = TRUE))

# We choose higher than 1/2
cor_absHigh <- names(which(apply(corSort, 1, function(x) abs(x)>0.5)))
cor_varNum <- cor_varNum[cor_absHigh, cor_absHigh]

corrplot.mixed(cor_varNum, tl.col="black", tl.pos = "lt", tl.cex = 0.7,cl.cex = .7, number.cex=.7)
```

It turns out the overall quality of the house and \textit{GrLivArea}, the above ground living area, are the variables of highest correlation with \textit{SalePrice}. We also see there is high intercorrelation between some of this variables of interest. Not surprisingly are features like \textit{GarageCars} and \textit{GarageArea}, and features like \textit{YearBuilt} and \textit{GarageYrBlt} highly correlated. In general we observe that variables concerning the same area (ie. garage, basement and so on) are typically highly correlated. This fact will give reason to a model reduction in the modeling phase by removal of variables that are highly correlated to a more important variable.

### Relationship with correlated features
Considering figure \@ref(fig:fig3) there seems to be a growing trend to in SalePrice when increasing the overall quality. Also there does not seem to be any obvious outlier, perhaps apart from the somewhat expensive house with overall quality 4. Considering figure \@ref(fig:fig4) there are no strong trend, but one could argue \textit{SalePrice} is a bit higher for newer houses than old houses. 

```{r fig3, fig.height = 1.7, fig.width = 6, fig.align = "center", fig.cap = "Boxplot of SalePrice divided by their overall quality ranging from 1 to 10.", message=FALSE, warning=FALSE, echo=FALSE}
# Create boxplots of SalePrice divided into overall quality
ggplot(data=full[!is.na(full$SalePrice),], aes(x=factor(OverallQual), y=SalePrice))+
        geom_boxplot(col='steelblue2') + labs(x='Overall Quality from 1 to 10') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) + theme(axis.text.x = element_blank())
```

```{r fig4, fig.height = 1.7, fig.width = 7, fig.align = "center", fig.cap = "Boxplot of SalePrice divided by the years the houses are built from 1872 to 2010", message=FALSE, warning=FALSE, echo=FALSE}
#Create boxplots of SalePrice diveded into YearBuilt
ggplot(data=full[!is.na(full$SalePrice),], aes(x=factor(YearBuilt), y=SalePrice))+
        geom_boxplot(col='steelblue2') + labs(x='Year built from 1872 to 2010') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) + theme(axis.text.x = element_blank())
```

Considering the relationship between \textit{SalePrice} and \textit{GrLivArea}, the second most correlated variable, we notice in figure \@ref(fig:fig5) that there is a somewhat strong positive trend in SalePrice as function of GrLivArea. One could suggest that house 596 and 199 are outliers, since they are very large properties with a relatively small sale price. Taking a harder look at these houses they have a maximum normalized score of 2.65 in overall quality, which gives further reason to classify them as outliers. We will keep these houses in mind when removing outliers.
```{r fig5, fig.height = 1.7, fig.width = 5, fig.align = "center", fig.cap = "Scatterplot with trend of SalePrice as function of GrLivArea", message=FALSE, warning=FALSE, echo=FALSE}
ggplot(data=full[!is.na(full$SalePrice),], aes(x=GrLivArea, y=SalePrice))+
        geom_point(col='steelblue2') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_text_repel(aes(label = ifelse(full$GrLivArea[!is.na(full$SalePrice)]>3.5, rownames(full), '')))
```

## Important predictors
Let us take a look at variables of high importance for the prediction. One can do this by building a multiple linear regression model using all the variables and look at the significance of each variable, but right now we would like to get a feel of the data without building a huge model. To this end we will use random forests as inspired by [@erik2017], where we will refer to a classic reference for random forests [@breiman2001]. The technique computes the mean square error before and after permuting the features among the dataset to which we fit a random forest. The technique is implemented in the R-library \texttt{randomForest}. 
```{r fig6, fig.height = 1.7, fig.width = 5, fig.align = "center", fig.cap = "Most important features found by random forest technique", message=FALSE, warning=FALSE, echo=FALSE}
set.seed(2019)
d1 = dim(train)[1]
d2 = dim(train)[2]
rf <- randomForest(x=train[1:d1,-d2], y=train$SalePrice[1:d1], ntree=100,importance=TRUE)
rfImp <- importance(rf)
rfImpDf <- data.frame(Variables = row.names(rfImp), MSE = rfImp[,1])
rfImpDf <- rfImpDf[order(rfImpDf$MSE, decreasing = TRUE),]

ggplot(rfImpDf[1:10,], aes(x=reorder(Variables, MSE), y=MSE, fill=MSE)) + geom_bar(stat = 'identity') + labs(x = 'Variables', y= '% increase MSE when variable is permuted') + coord_flip() + theme(legend.position="none")
```

In figure \@ref(fig:fig6) we see the 10 most important variables as ranked by the random forest model. We have already seen some of these variables' relationship with the response variable. Indeed it comes as no surprise that the above ground living area and overall quality are important, since houses scoring high in these variables are expected to be expensive. Let us now take a look on the distribution of predictors among the top 10, four of which are seen in figure \@ref(fig:fig7). Some of the variables seem to have a central Gaussian distribution like \textit{GrLivArea} or \textit{1stFlrSF}, while distribution for other variables are skew like \textit{OverallQual}, and yet other variables distribution are harder to determine since they have already undergone some transformation. We will keep this in mind when standardizing and and log-transforming in the modelling phase.

```{r fig7, fig.height = 2.2, fig.width = 4, fig.align = "center", fig.cap = "Distributions of important variables", message=FALSE, warning=FALSE, echo=FALSE}
# Plotting distributions of most important variables found with random forest
plot1 <- ggplot(data=full, aes(x=GrLivArea)) +
        geom_density() + labs(x='Square feet living area')

plot2 <- ggplot(data=full, aes(x=as.factor(Neighborhood))) +
        geom_histogram(stat='count') + labs(x='Neighborhood') + theme(axis.text.x = element_blank())

plot3 <- ggplot(data= full, aes(x=as.factor(OverallQual))) +
        geom_histogram(stat='count') + labs(x='Overall quality') + theme(axis.text.x = element_blank())

plot4 <- ggplot(data= full, aes(x=as.factor(MSSubClass))) +
        geom_histogram(stat='count') + labs(x='MSSubClass') + theme(axis.text.x = element_blank())

# plot5 <- ggplot(data= full, aes(x=GarageCars)) +
#         geom_histogram(stat='count') + labs(x='Size of garage in car capacity') + theme(axis.text.x = element_blank())
# 
# plot6 <- ggplot(data= full, aes(x=TotalBsmtSF)) +
#         geom_density() + labs(x='Square feet basement')
# 
# plot7 <- ggplot(data= full, aes(x=`1stFlrSF`)) +
#         geom_density() + labs(x='Square feet 1st floor')
# 
# plot8 <- ggplot(data= full, aes(x=BsmtFinSF1)) +
#         geom_density() + labs(x='Type 1 finished square feet')
# 
# plot9 <- ggplot(data= full, aes(x=BsmtFinType1)) +
#         geom_histogram(stat='count') + labs(x='Rating of basement finished area')
# 
# plot10 <- ggplot(data= full, aes(x=`2ndFlrSF`)) +
#         geom_density() + labs(x='Square feet second floor')

# Livearea, Neighborhood, Overall, MSS, GarageCars, TotalBsmtSF, 1stflrSF, BSMTfinSF1, BSMTfinTyp1, 2ndflrSF
layout <- matrix(1:4,2,2,byrow=TRUE)
multiplot(plot1,plot2,plot3,plot4, layout=layout)
```
One idea we could explore further is to merge the different neighborhoods in the neighborhood variable if the sale prices are similar enough. In this way we could reduce the dimensionality of this categorical variable.

# Modeling and Diagnostics
Let us start the modelling by preparing our dataset. This preparation will consist of one-hot encoding of all categorical variables and a standardization of numerical variables. Our data combined now amounts to 1460 observations with 199 variables.
```{r feature engineering, message=FALSE, warning=FALSE, echo=FALSE}
# Standardize numerical variables except for SalePrice
varNumNames <- varNumNames[!(varNumNames %in% 'SalePrice')]
varNum <- full[, names(full) %in% varNumNames]
preNum <- preProcess(varNum, method=c("center", "scale"))
varNumStd <- predict(preNum, varNum)

# One-hot encode nominal variables into 'dummy' variables
varFactors <- full[, !(names(full) %in% varNumNames) & !(names(full) %in% 'SalePrice')]
varDummy <- as.data.frame(model.matrix(~.-1, varFactors))
varFull <- cbind(varNumStd, varDummy)

# Training data
dimTrain = dim(train)[1]
varTrain = cbind(train$SalePrice,varFull[1:dimTrain,])
names(varTrain)[1] = "SalePrice"
```

## A multiple linear regression model 
We have fitted our training data to a simple multiple linear regression model to the data of $p=199$ variables and $n = 1095$,
$$Y = X^T\beta^\ast + \varepsilon,$$
where $\varepsilon \sim N(0,\sigma^2)$, $(X,Y) \in \mathbb{R}^{p} \times \mathbb{R}$ and $\beta^\ast\in \mathbb{R}^p$ is the unknown regression vector. We assume we have a sample of i.i.d. copies of $(X,Y): (X_i, Y_i)_{i=1}^{n}$. We will explore this assumption further as well as the various assumptions regarding the residuals, see \textbf{P1} to \textbf{P4} in [@karim2019]. Using our designmatrix of all our observations $\mathbb{X}=\left(X_{1}, \ldots, X_{n}\right)^{\top}=\left(X_{i, j}\right)_{1<i<n, 1<j<p}$ and using our observed responses $\mathbb{Y} = \left(Y_1,Y_2,\ldots, Y_n \right )^T$, we want to estimate $\beta^\ast$ by the following minimization problem
$$\widehat{\boldsymbol{\beta}}=\arg \min _{\boldsymbol{\beta}\in \mathbb{R}^p}\|\mathbb{Y}-\mathbb{X} \boldsymbol{\beta}\|^{2}.$$
Notice then that the residuals $\epsilon  = \mathbb{Y}-\mathbb{X} \boldsymbol{\beta} \sim N(0,\sigma^2 I_n)$. This is done in \texttt{R} using the \texttt{lm}-function. The diagnostics for this regression can bee seen in figure \@ref(fig:simplemodel).

```{r simplemodel, fig.height = 3, fig.width = 5.5, fig.align = "center", fig.cap = "Diagnostics of full linear model", message=FALSE, warning=FALSE, echo=FALSE}
simpleLM <- lm( formula="SalePrice ~ .", data=varTrain)
#summary(simpleLM)
par(mfrow=c(2,2), mar=c(2.5,2.5,2.5,2.5), mgp=c(1.5,0.5,0)) # Change the panel layout to 2 x 2
plot(simpleLM)
par(mfrow=c(1,1)) # Change back to 1 x 1
```

There are many different problems with this model. Not only does it use all 199 variables, many of which are unsignificant and many of which return parameters with N/A. We tackle this by reducing the number of dimensions in the variables. Here we merge neighborhoods of similar distribution in \textit{SalePrice}. Indeed there is a clear top 3 in neighborhoods in terms of mean and median sale prices. Namely \textit{NoRidge}, \textit{Stonebr} and \textit{NridgHt} seems to be more expensive neighborhoods to buy a house in. Furthermore we collect the bottom 4 neighborhoods in a cheaper category. We collect the rest in a middle neighborhood.

```{r Reduce levels Neighborhood, fig.height = 5, fig.width = 7, fig.align = "center", fig.cap = "Distributions of important variables", message=FALSE, warning=FALSE, echo=FALSE, results = 'hide'}
# Compute mean and median of SalePrice divided into neighborhoods
meansbyNeighborhood = aggregate(train$SalePrice, by=list(train$Neighborhood), FUN=mean)
statsbyNeighborhood = cbind(meansbyNeighborhood, aggregate(train$SalePrice, by=list(train$Neighborhood), FUN=median)[2])
names(statsbyNeighborhood) = c("Neighborhoods", "Mean", "Median")
statsbyNeighborhood[order(statsbyNeighborhood$Mean,decreasing = TRUE),][1:3,]

# They agree on top 3 and they agree on for which there are a significant jump down to # 4. They agree on bottom 4 as well.
# Based on this we collect the Neighborhoods in [low, medium, high]. This is indeed ordinal
full$Neighbor[full$Neighborhood %in% c('NoRidge', 'Stonebr', 'NridgHt')] <- 2
full$Neighbor[!full$Neighborhood %in% c('NoRidge', 'Stonebr', 'NridgHt', 'IDOTRR', 'MeadowV', 'BrDale', 'OldTown')] <- 1
full$Neighbor[full$Neighborhood %in% c('IDOTRR', 'MeadowV', 'BrDale', 'OldTown')] <- 0
full$Neighborhood <- NULL
```

As a way of reducing the dimensionality of categorical variables that are sparse we will consider Tukey’s test to determine if the mean difference between specific pairs of group in the categorical variable are statistically significant. As an example consider the variable \textit{BldgType}. 

```{r tukey, message=FALSE, warning=FALSE, echo=FALSE, results = 'hide'}
# BldgType variables
dummy <- cbind(train$SalePrice,varFactors[1:dimTrain,9])
names(dummy)[1] <- "SalePrice"
lmDummy <- lm( formula="SalePrice ~ .", data=dummy)
aovDummy <- aov(lmDummy)
tukey <- TukeyHSD(aovDummy)
# Select insignificant differences
tidy(tukey) %>%
  filter(adj.p.value > 0.05)
full$BldgType<-as.factor(recode(full$BldgType,'TwnhsE'='Type1', '1Fam'='Type1', 'Twnhs'='Type2', 'Duplex'='Type2','2fmCon'='Type2'))

# RoofStyle variables
dummy <- cbind(train$SalePrice,varFactors[1:dimTrain,11])
names(dummy)[1] <- "SalePrice"
lmDummy <- lm( formula="SalePrice ~ .", data=dummy)
aovDummy <- aov(lmDummy)
tukey <- TukeyHSD(aovDummy)

# Select insignificant differences
#tidy(tukey) %>%
  #filter(adj.p.value > 0.05) # Uncomment to see result - this is only for the report

# It comes to no surprise (Flat, Gambrel, Mansard, Shed) share insignificant mean differences, but they're also the category which very few observations. We will merge these into one: other.
full$RoofStyle<-as.factor(recode(full$RoofStyle,'Flat'='Other', 'Gambrel'='Other', 'Mansard'='Other','Shed'='Other'))
```
From the insignificant differences we group the variables in two thereby reducing dimensionality and ensuring non-sparsity in the future dummy variables. We do something similar for \textit{RoofStyle}.

In order to explain more of the variance of the residuals we see in the diagnostics plot, \textit{Residuals vs Fitted}, we construct 2 new variables based on the old ones: \textit{Age} and \textit{TotalBath}, which counts the total number of bathrooms in the house. We will consider creating a model of higher order at a later point.
```{r new variables, fig.height = 5, fig.width = 7, fig.align = "center", fig.cap = "Distributions of important variables", message=FALSE, warning=FALSE, echo=FALSE}
# We assume the standardization which has already taken place does not matter
# Age of the house
full$Age <- as.numeric(full$YrSold)-full$YearRemodAdd
# Adding all bathroom to one variable
full$TotalBath <- full$FullBath  + full$BsmtFullBath + (full$HalfBath*0.5) + (full$BsmtHalfBath*0.5)
```

The diagnostics in Figure \@ref(fig:simplemodel) of the model reveals issues with the assumptions of the linear model. In the scale-location plot, we see a clear trend. This implies there is problem with \textbf{homoscedasticity} of residuals; the price of expensive houses varies more than cheap houses. This also brings us to the skewness of the response variable as well as the predictors, which contradicts the assumption of \textbf{multivariate normality}. In order to fix this issues we will log-transform skew predictors. We do a \textbf{Box-Cox} transform of the response variable, since this gave better results compared to a log-transform. This will also improve the spread location of our residuals.
```{r, message=FALSE, warning=FALSE, echo=FALSE}
# Splitting data into categorical and numerical variables
varCat <- which(sapply(full, is.factor)) 
varNum <- which(sapply(full, is.numeric))
varNumNames <- names(varNum)
full_varNum <- full[, varNum]

# Numerical variables without 'SalePrice'
varNumNames <- varNumNames[!(varNumNames %in% 'SalePrice')]
varNum <- full[, names(full) %in% varNumNames]

K = min(varNum)
i = 1
while(i <= ncol(varNum)){
        # If skewness is high enough we log-transform
        if (abs(skew(varNum[,i]))>0.8){
                # We add 1 and extract K, such that varNum[,i] + 1 - K = 1 for the most negative value
                varNum[,i] <- log(varNum[,i] + 1 - K)
        }
        i <- i + 1
}

# Fix skewness of SalePrice
# No need to add anything since SalePrice > 0 
lambda = 0.3
train$SalePrice <- (train$SalePrice^lambda - 1)/lambda

# Standardize numerical variables except for SalePrice
preNum <- preProcess(varNum, method=c("center", "scale"))
varNumStd <- predict(preNum, varNum)

# One-hot encode nominal variables into 'dummy' variables
varFactors <- full[, !(names(full) %in% varNumNames) & !(names(full) %in% 'SalePrice')]
varDummy <- as.data.frame(model.matrix(~.-1, varFactors))

```
Now upon one-hot encoding categorical variables, we might be worried about the sparsity in certain levels of the variables. This could lead to akward situations where the level is nonzero in the training set, but not in the testing set or vice versa. We fix this by simply removing levels with fewer than 10 observations, since \textit{treaments} with fewer observations will hardly explain the variance in the response variable. This reduces the number of variables from 199 to 124.

```{r, message=FALSE, warning=FALSE, echo=FALSE}
# Handling of levels with few observations in category variables, i.e. are any dummy variable close to the zero vector?
lowLevels <- which(colSums(varDummy[1:dimTrain,])<10)
# We remove levels with fewer than 10 observations
varDummy <- varDummy[,-lowLevels]
# We have managed to reduce the number of dummy variables to 77!
#dim(varDummy)
varFull <- cbind(varNumStd, varDummy)

# Extract training and testing data
dimTrain = dim(train)[1]
varTrain = cbind(train$SalePrice,varFull[1:dimTrain,])
varTest = varFull[(dimTrain+1):nrow(varFull),1:ncol(varFull)]
names(varTrain)[1] = "SalePrice"
# Dimension went from 199 to 124!
#dim(varTrain)
```

As a final note to the diagnostics Figure \@ref(fig:simplemodel) we would like to \textbf{remove outliers}. The "Residuals vs Leverage" plot reveals potential leverage points 830 and 516, which also seems to be regression outliers in the "Scale-Location" plot. We will not remove these just yet, since the model assumptions are not satisfied. In addition the leverage analysis locates certain observations of Cook's distance 1. We suspect this is due to $h_{ii} =1$ for some observation $i$, hence $P_X Y = Y$, where $P_X$ is the usual projection. This could just be due to sparsity. Lastly we remove the outliers 596 and 199, partly because of the reasons already stated: very low sale price compared to area and overall quality, and partly since they have the highest Cook's distance for the new linear model.
```{r, message=FALSE, warning=FALSE, echo=FALSE}
# simple model with outliers
simpleLM2 <- lm( formula="SalePrice ~ .", data=varTrain)
# Uncomment to see result with outliers 596 and 199
#summary(simpleLM2)
#par(mfrow=c(2,2), mar=c(2.5,2.5,2.5,2.5), mgp=c(1.5,0.5,0)) # Change the panel layout to 2 x 2
#plot(simpleLM2)
#par(mfrow=c(1,1)) # Change back to 1 x 1
```


```{r simplemodel3, fig.height = 3, fig.width = 5.5, fig.align = "center", fig.cap = "Diagnostics of linear model with modifications", message=FALSE, warning=FALSE, echo=FALSE}
# Simple model without outliers
varTrain <- varTrain[-c(199,596),]
simpleLM3 <- lm( formula="SalePrice ~ .", data=varTrain)
#summary(simpleLM3)
par(mfrow=c(2,2), mar=c(2.5,2.5,2.5,2.5), mgp=c(1.5,0.5,0)) # Change the panel layout to 2 x 2
plot(simpleLM3)
par(mfrow=c(1,1)) # Change back to 1 x 1
```
To comment our diagnostics of our new improved linear model, Figure \@ref(fig:simplemodel3), it seems there is still unexplained variance in the residuals. The residual seems normally distributed and while the "Scale-Location" plot does reveal a trend, it is better. Indeed now there are a pretty similar variance for the most of the residuals, but for more extreme fitted values the variance is higher. Finally we seem to have no leverage points. Without showing a plot we assure the reader that indeed the autocorrelation assumption of the residuals is also satisfied. We note for this model we have $\texttt{RMSE}=24659.24$ and $R^2 = 0.91$ evaluated on the test data, which can be improved considerably. See the section "Comparison" for a comparison with our other models.

```{r, message=FALSE, warning=FALSE, echo=FALSE}
# Predict from model
predictSP = predict(simpleLM3,varTest)
# Inverse box-cox transform
predictSP = (lambda*predictSP + 1)^(1/lambda)

rmse = RMSE(predictSP, testPrice) # RMSE
r2 = R2(predictSP, testPrice) # R^2
nrmse = rmse/mean(testPrice) # Normalized
```


# Final models
Our current linear model faces the challenge of reduction and is still lacking in explaining variance in the residuals. Because of the colinearity and sparsity of many variables t-tests is not necesarrily a good way of estimating variables significance. Fisher tests seems a bit infeasiable for this amount of variables, so we must turn to other methods. Thus we will start by gaining some knowledge from a Lasso regression. It is a linear regression model utilizing a $\ell_1$ penalty, i.e. estimates the parameters of the linear model by minimizing the Lagrangian
$$\widehat{\boldsymbol{\beta}}=\arg \min _{\boldsymbol{\beta}\in \mathbb{R}^p} \left\{\|\mathbb{Y}-\mathbb{X} \boldsymbol{\beta}\|^{2}+\lambda\|\beta\|_{1}\right\},$$
where $\lambda$ is a parameter we optimally obtain by 10-fold cross-validation. For sufficiently large values of $\lambda$ this penalty produces parsimonious models.  

## Lasso
```{r, message=FALSE, warning=FALSE, echo=FALSE}
set.seed(2019) 
# Order 1
x1 <- model.matrix(SalePrice~., data = varTrain)[,-1]
# Order 2
x2 <- model.matrix(SalePrice~.+.^2, data = varTrain)[,-1]
y <- varTrain$SalePrice

# Cross validation for Lasso to find best lambda
cv1 <- cv.glmnet(x1, y, alpha = 1)
cv2 <- cv.glmnet(x2, y, alpha = 1)

# Use either lambda.min or lambda.1se for optimal vs. most regularized with 1 sd of min
model1 <- glmnet(x1, y, alpha = 1, lambda = cv1$lambda.min)
model2 <- glmnet(x2, y, alpha = 1, lambda = cv2$lambda.min)

# Find variable importance with function from caret package
# Use either lambda.min or lambda.1se for optimal vs. most regularized with 1 sd of min
lassoImp1 = varImp(model1, lambda = cv1$lambda.min)
lassoImp2 = varImp(model2, lambda = cv1$lambda.min)

# How many were not selected?
varsSelected1 <- which(lassoImp1$Overall>0)
varsSelected2 <- which(lassoImp2$Overall>0)
varsNotSelected1 <- which(lassoImp1$Overall==0)
varsNotSelected2 <- which(lassoImp2$Overall==0)
n1 <- length(varsNotSelected1)
n2 <- length(varsNotSelected2)

# Ordered importance of variables of first order
lassoImp1 <- as.data.frame(lassoImp1)
lassoImp1 <- data.frame(overall = lassoImp1$Overall[varsSelected1], names = rownames(lassoImp1)[varsSelected1])
orderImp1 <- lassoImp1[order(lassoImp1$overall,decreasing = T),][2]
#lassoImp1[order(lassoImp1$overall,decreasing = T),] #uncomment for result

# Ordered importance of variables of second order
lassoImp2 <- as.data.frame(lassoImp2)
lassoImp2 <- data.frame(overall = lassoImp2$Overall[varsSelected2], names = rownames(lassoImp2)[varsSelected2])
#lassoImp2[order(lassoImp2$overall,decreasing = T),]  #uncomment for result
orderImp <- lassoImp2[order(lassoImp2$overall,decreasing = T),][2]
```

We find Lasso estimators for 2 models: one of first order and one of second order interactions of our variables. The first model throws away 51 of the 124 variables, whereas the second model throws away 7409 out of 7626 variables, which is a quite remarkable reduction in both cases. 

Using instead the $\lambda$ which gives the most regularized model such that error is within one standard error, we find among the important variables: \textit{GrLivArea}, \textit{OverallQual}, \textit{SaleTypeNew}, \textit{KitchenQual} in and \textit{LotArea} as top 5 in the \textbf{first model}. Not far away from our results from random forests, Figure \@ref(fig:fig6). In this model we throw away 89 and 7549 variables respectively. The importance of these variables make a lot of sense: the above ground living area, the overall quality and kitchen quality are just very important factors for people. 

As important variables in the \textbf{second model} we find \textit{GrLivArea}, \textit{OverallQual}, \textit{PavedDrive:Exterior1stBrkFace}, \textit{KitchenQual} and \textit{LotArea} as top 5, which are indeed very close to model 1, but also rather surprising. Somehow the interaction between a paved drive and the exterior covering on the house is quite relevant for the house. We also find the interaction \textit{Neighbor:Condition2Norm} of surprisingly high importance: it is the combination of being in the right neighborhood and having a normal proximity to conditions like railroads.

```{r, message=FALSE, warning=FALSE, echo=FALSE}
# Predict from model1 and model2
yTest1 <- model.matrix(~.,data=varTest)[,-1]
yTest2 <- model.matrix(~.+.^2,data=varTest)[,-1]

yTrain1 <-  model.matrix(~.,data=varTrain[2:ncol(varTrain)])[,-1]
yTrain2 <-  model.matrix(~.+.^2,data=varTrain[2:ncol(varTrain)])[,-1]

predict1 = predict(model1, yTest1)
predict2 = predict(model2, yTest2)
predictTrain1 = predict(model1, yTrain1)
predictTrain2 = predict(model2, yTrain2)
res1 = varTrain$SalePrice - predictTrain1
res2 = varTrain$SalePrice - predictTrain2

# Inverse box-cox transform
predict1 = (lambda*predict1 + 1)^(1/lambda)
predict2 = (lambda*predict2 + 1)^(1/lambda)

rmse1 = RMSE(predict1, testPrice) # RMSE
rmse2 = RMSE(predict2, testPrice) # RMSE
nrmse1 = rmse1/mean(testPrice) # Normalized
nrmse2 = rmse2/mean(testPrice) # Normalized

r21 = R2(predict1, testPrice) # R^2
r22 = R2(predict2, testPrice) # R^2

```

We note RMSE is 25231.37 and 21482 for the first and second model respectively, while $R^2$ is 0.911 and 0.937 when used on the testing data. 
```{r, message=FALSE, warning=FALSE, echo=FALSE, fig.height = 2.3, fig.width = 5.5, fig.align = "center", fig.cap = "Centrality of residuals with respect to the fitted values in the first and second Lasso model"}
par(mfrow=c(1,2), mar=c(3,3, 3, 3), mgp=c(1.5,0.5,0)) # Change the panel layout to 2 x 1
plot1 <- plot(predictTrain1, res1, main="Residuals vs Fitted", 
   xlab="Fitted values", ylab="Residuals ", pch=19)
lines(lowess(predictTrain1, res1), col = "red")
plot2 <- plot(predictTrain2, res2, main="Residuals vs Fitted", 
   xlab="Fitted values ", ylab="Residuals", pch=19)
lines(lowess(predictTrain2, res2), col = "red")
par(mfrow=c(1,1)) # Change back to 1 x 1
```
As a final note we have produced a simple linear regression model including the top 50 important variables as found by the Lasso estimator. We then use the Akaike information criterion (\textbf{AIC}) to select a model using forward, backward and both ways selection, gaining the the same model with backward and both ways selection, yielding only a 6 variable reduction. Doing an \textbf{ANOVA} on this remaining model of 44 variables suggests indeed all variables are significant, except for one, which has a p-value of 0.10. The estimates, standard errors, confidence intervals, and p-values of the parameters for this model can be found in the \textbf{Appendix}. 
```{r message=FALSE, results = 'hide', warning=FALSE, echo=FALSE,}
order50Imp <- orderImp[1:50,]
formula = formula(paste("SalePrice ~ ", paste(as.vector(order50Imp), collapse=" + ")))
FinalLM <- lm(formula, data = varTrain)
modelAICboth <- stepAIC(FinalLM, direction = "both")
modelAICback <- stepAIC(FinalLM, direction = "backward")
modelAICfor <- stepAIC(FinalLM, direction = "forward")
#modelAICboth$anova
#modelAICback$anova
#modelAICfor$anova

#summary(modelAIC)
# Predict from model
predictAIC = predict(modelAICboth,varTest)
# Inverse box-cox transform
predictAIC = (lambda*predictAIC + 1)^(1/lambda)

rmseAIC = RMSE(predictAIC, testPrice) # RMSE
r2AIC = R2(predictAIC, testPrice) # R^2
nrmseAIC = rmseAIC/mean(testPrice) # Normalized

# Summary and confidence interval of variables
confin <- confint(modelAICboth, level = 0.95)
#summaryAIC <- cbind(summary(modelAICboth)$coefficients,confin) # We comment this here and put instead in appendix
# Anova
#anova(modelAICboth)
```

## Gradient Boosting
As a final model we will use gradient boosting as implemented in the Xgboost library. We will not comment the method here, rather refer to the paper [@chen2016xgboost], and comment the final result in the next section.

```{r, message=FALSE, warning=FALSE, echo=FALSE}
cv10 <-trainControl(method="cv", number=4, verboseIter = TRUE)
grid = expand.grid(nrounds = 500, eta = c(0.1, 0.05, 0.01), 
                       max_depth = c(3, 4, 5), gamma = 0,
                       colsample_bytree=1, min_child_weight=c(2, 3, 4), 
                       subsample=1)
```

```{r cross validation, message=FALSE, warning=FALSE, echo=FALSE}
# Don't run this unless you want to wait a loooooong time

#xgb<- train(x=x, y=varTrain$SalePrice, method='xgbTree', trControl= cv10, tuneGrid=grid) 
#xgb$bestTune

# result: nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
#             500         3 0.05     0                1                2         1
```

```{r, message=FALSE, warning=FALSE, echo=FALSE}
## Predict with Xg boost! 
# predictXg = predict(xgb, yTest)
# # Inverse box-cox transform
# predictXg = (lambda*predictXg + 1)^(1/lambda)
# 
# rmseXg = RMSE(predictXg, testPrice) # RMSE
# r2Xg = R2(predictXg, testPrice) # R^2
# nrmseXg = rmseXg/mean(testPrice) # Normalized
```

```{r, message=FALSE, warning=FALSE, echo=FALSE}
## Predicting SalePrice with ensemble model by averaging

# predictions for Xgboost and the second order Lasso 
# finalPrediction = (predictXg + predict2)/2
# 
# rmseFinal = RMSE(finalPrediction, testPrice) # RMSE
# r2Final = R2(finalPrediction, testPrice) # R^2
# nrmseFinal = rmseFinal/mean(testPrice) # Normalized
```

## Comparison
In this section we compare our models predictions on the test data. Note we have only \textit{trained} and estimated parameters on the training data. In the case of the Lasso and Xgboost we trained the models using cross validation, hence this adds another test layer.

```{r, message=FALSE, warning=FALSE, echo=FALSE}
# Encode with numbers since commented out Xgboost prediction
Ensemblevec = c(22079.30, 0.1226, 0.9321)
Xgboostvec = c(24235.31, 0.1346, 0.9147)
Lasso1 = c(rmse1, nrmse1, r21[1])
Lasso2 = c(rmse2, nrmse2, r22[1])
LinM = c(rmse, nrmse, r2[1])
AIC = c(rmseAIC, nrmseAIC, r2AIC[1])
table <- data.frame("Ensemble" = format(Ensemblevec,digits=3,nsmall = 0), "Xgboost" = format(Xgboostvec,digits=3,nsmall = 0), "Lasso1" = format(Lasso1,digits=3,nsmall = 0), "Lasso2" = format(Lasso2,digits=3,nsmall = 0), "Linear model" = format(LinM,digits=3,nsmall = 0), "Linear model AIC" = format(AIC,digits=3,nsmall = 0))
row.names(table) = c("RMSE", "Normalized RMSE", "R^2")
kable(table, "latex", caption = "Testing metrics of our final models. Lasso1 and Lasso2 uses 72 and 217 variables resp, where as the simple linear model uses 124. The AIC selected linear model uses only 45", booktabs = T) %>% 
  kable_styling(latex_options = c("striped", "hold_position"))
```

Considering Table 1 we conclude Lasso2 gets the best result. Even combined with the Xgboost model the ensemble method does not perform better. Interestingly there is not much difference in performance between the simple linear model using 124 variables and the linear model only using 45 variables. 

# Discussion.
We conclude our hypothesis somewhat corresponds to our results. We have seen a fairly accurate description of the sale price with a Lasso model using 217 variables which performs best on the test data with a normalized MSE of $0.1193$. We have seen this model performs well on the diagnostics as well. We have considered the most important variables as found by different methods: random trees, Lasso, AIC, anova, and thereby combating the colinearity of the many variables. Finally we have seen that a large reduction of variables is possible, while still maintaining a certain level of accuracy. 

With that in mind there are some limitations of our analysis. For example the data cleaning could be better, as there is a lot of information lost in the process. We put a lot of trust in our Lasso model, however it is not very precise, since it is a biased estimator due to the regularization. This also means it does not make sense to estimate the variability of the parameters making it harder to interpret the model. The Xgboost model needs more training in order to perform better, but this is also an opportunity for further work on our predictions. This includes hypertuning of the parameters as well. 

Future directions of our research could include an improvement of our reduced linear model by picking a number of important variables suggested by an ensemble model of Xgboost, random trees and Lasso and then finally reducing this my clever model selection tools as glmulti or simulated annealing. Then using a more rigorous MANCOVA we could gain more insights in the variables. Finally a lot more work could be put into the interpretation of the important variables and their parameter estimations - what weighs positively and what weighs negatively?

\newpage
# References
<div id="refs"></div>


# Appendix
```{r, message=FALSE, warning=FALSE, echo=FALSE}
summaryAIC <- cbind(summary(modelAICboth)$coefficients,confin)
kable(summaryAIC, format="latex")  %>%
  kable_styling(latex_options=c("scale_down","hold_position"))
```