-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathReport.Rmd
208 lines (163 loc) · 8.2 KB
/
Report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: "Data Visualization and Analysis"
subtitle: "Semester Project"
author:
- Amrit Kaur
- Pratik Kumar Agarwal
- Mayuri Mendke
- Nofel Mahmood
output: pdf_document
---
# Dataset
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
# Motivation
The Dataset has 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa. The problem was to predict sale prices of the houses based on the variables. The problem was interesting to us because we had studied linear regression and other models to solve problems like this.
# Approach
As there were a lot of variables we tried to first find out the ones which were influencing the house price on a higher level and then used them to build our linear model to predict house sale prices. After that we applied time series to forecast house prices for the next 10 years. In the end we applied clustering to group houses with respect to sale price (low, mid, high) and use inference tree to show sale prices with respect to years in which the particular house was built.
# Exploration / Visualizations
Install Pacman
```{r}
library(pacman)
```
Load all required packages
```{r}
p_load(tidyverse, stringr,lubridate, ggplot2, tseries, forecast, scales, party)
house_training_data <- read.csv("./DataSet/train.csv")
house_test_data <- read.csv("./DataSet/test.csv")
```
Get the idea of Minimum and Maximum Price of the house along with mean and others.
```{r}
summary(house_training_data$SalePrice)
```
Add saleprice column to the test data. And assigned it to a new variable.
Combine both the training and test data. It will be easier for analysis. From now on we will work on this dataset.
```{r}
house_test_data.SalePrice <-
data.frame(SalePrice = rep(NA, nrow(house_test_data)), house_test_data[,])
house_test_data.SalePrice <-
data.frame(SalePrice = rep(NA, nrow(house_test_data)), house_test_data[,])
house_combined <- rbind(house_training_data, house_test_data.SalePrice)
dim(house_combined) #Dimention of the combined dataset.
```
With this plot we can say:
Few people can afford very expensive houses.
Majority of people bought houses in the range 1,00,000 to 2,50,000.
```{r}
training_data <- house_training_data[!is.na(house_combined$SalePrice),]
training_data %>% ggplot(aes(x=SalePrice)) +
geom_histogram(binwidth = 10000) +
scale_x_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
```
Now we have to find which attributes are more significant for SalePrice.
```{r}
#We don't need ID. So drop ID column from house_combined
house_training_data$Id <- NULL
#Here we have selected only those variables which has type numeric.
#Now we can check there correlation with SalePrice.
numeric.type.variables <- which(sapply(house_training_data, is.numeric))
numeric.type.name.variables <- names(numeric.type.variables)
cor.numeric.variables <- cor(house_training_data[, numeric.type.variables],
use="pairwise.complete.obs")
#Lot of NA's .
#so we use="pairwise.complete.obs".
#sort the correlation with saleprice in decreasing order.
#So we will get the highly correlated variable at the top.
cor_sorted <- as.matrix(sort(cor.numeric.variables[,'SalePrice'], decreasing = TRUE))
colnames(cor_sorted)<- c("values")
#Select only high correlation
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
#So we got "OverallQual" as the highly significant variable for Saleprice and after that
#we "GrLivArea" and so on..
model_OverallQual<-lm(SalePrice~OverallQual, data = house_training_data)
summary(model_OverallQual)
ggplot(house_training_data[!is.na(house_training_data$SalePrice),],
aes(x= factor(OverallQual), y = SalePrice)) +
geom_boxplot() + labs(x = "OverallQual", y = "SalePrice") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
#We can Clearly see that increase in the overall quality of the house has increased the
#saleprice of the house.
```
# Our models
```{r}
model_GrLiveArea<-lm(SalePrice~GrLivArea, data = house_training_data)
summary(model_GrLiveArea)
ggplot(house_training_data[!is.na(house_training_data$SalePrice),],
aes(x= GrLivArea, y = SalePrice)) + geom_point() +
geom_smooth(method = "lm") + labs(x = "GrLivArea", y = "SalePrice") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
#Next highly correlated variable was "GrLivArea" i.e ground living area square feet.
#As the house with bigger living area will have high sale price.
#The two dots at the bottom right seems to be the outliers.
factor.type.variables.names<- which(sapply(house_training_data, is.factor))%>% names()
model_Street_Neighborhood <- lm(SalePrice ~ Street+Neighborhood+GarageCond+
KitchenQual+MiscFeature,
data = house_training_data)
summary(model_Street_Neighborhood)
#Here, we can see that street and Neighbourhood are significant variables effecting the
#SalesPrice of an house. The dummy Variables StreetPave,NeighborhoodCollgCr,
#NeighborhoodCrawfor are most significant.
#The model is very good because it has a high R value.
#Similarly we checked for other variables.
#So in our final model we are using Neighborhood, OverallQual, GrLiveArea,
#GarageCars, BsmtCond, TotalBsmtSF for prediction on our test data.
model_trained <- lm(SalePrice~Neighborhood+BsmtQual+OverallQual+GrLivArea+
GarageCars+TotalBsmtSF, data = house_training_data)
#Our model is trained . Now we will predict SalePrice on test dataset
summary(model_trained)
pred_lm <- predict.lm(model_trained, house_test_data.SalePrice)
house_test_data_with_predictions <- house_test_data.SalePrice %>%
mutate(predictedSalePrice = pred_lm)
```
```{r}
```
# Time Series
```{r}
#timeseries object for Sales Price
actual_preds <- data.frame(cbind(actuals=house_test_data.SalePrice$SalePrice,
predicteds = pred_lm))
salePricets<-ts(actual_preds$predicteds,start=c(2001,1),end=c(2010,12),frequency = 4);
#timeseries object for Sales Price and Selling Year
yrSoldts<-ts(house_test_data_with_predictions$YrSold,start=c(2001,1),
end=c(2010,12),frequency = 4);
salePricets<-ts(house_test_data_with_predictions$predictedSalePrice,
start=c(2001,1),end=c(2010,12),frequency = 4);
#Checking for frequency data has been collected.
frequency(salePricets);
#checking for missing values
sum(is.na(salePricets))
#summary of the data
summary(salePricets)
#decomposing the data into trend, seasonal, regular and random components
tsdata<-ts(salePricets,frequency = 4)
ddata<-decompose(tsdata,"multiplicative")
plot(ddata)
#checking the original trend in data while performing linear regression.
plot(salePricets)
abline(reg=lm(salePricets~time(salePricets)))
cycle(salePricets)
#boxplot for quaterly data to analyse in which quater sales price is going up
boxplot(salePricets ~cycle(salePricets, xlab="Date"))
#checking for the best model
priceModel<-auto.arima(salePricets)
priceModel
#running with trace to compare the information criterion
auto.arima(salePricets,ic="aic",trace= TRUE)
#Using the model to forecast for next 5 years with 95% accuracy
priceForecast<-forecast(priceModel,level=c(95),h=5*12)
plot(priceForecast)
```
# Clustering
```{r}
# Get Predicated Sale Price with Year Built
sale_price_with_built_year <- house_test_data_with_predictions %>%
select(YearBuilt, predictedSalePrice) %>% na.omit()
cluster <- kmeans(sale_price_with_built_year, 3)$cluster
cbind(sale_price_with_built_year, cluster) %>%
ggplot((aes(x = YearBuilt, y = predictedSalePrice, color = factor(cluster)))) +
geom_point()
tree <- ctree(predictedSalePrice ~ ., data = sale_price_with_built_year,
controls = ctree_control(minbucket = 100))
plot(tree)
```
# Conclusion
We saw that the variables which we used to build our linear model were effecting the sale price on a higher level such as Neighborhood, BsmtQual, OverallQual, GrLivArea, GarageCars, TotalBsmtSF. Then we used Time Series to forecast sale prices for the next 10 years. In the end we saw by applying k-means clustering that house prices with respect to the year they were built in can be clustered into high, low and mid sale prices. We can see that the most expensive houses can be found after 1980(year built) (approx).