-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpml-project.rmd
118 lines (85 loc) · 3.46 KB
/
pml-project.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: "PML-project"
output: html_document
---
Objective: Predicting the manner in which an individual did the exercise. Create a report describing how the model was built using cross validation, what the error was and why a particular choice was made. And finally use the cross-validate model to do the prediction for 20 test cases.
Datasets: WLE Data Set consisting of 159 columns, one being the predicting variable
The whole process can be divided in 3 major categories: Preprocessing and Data exploration, Model Building, Prediction
Preprocessing and Data exploration
Selection of Independent variables
By reading through description of datasets and looking through the summary(traindata)
```{r}
library(caret)
library(randomForest)
library(tree)
library(gbm)
setwd ("C:\\pml\\")
traindata = read.csv("pml-training.csv", header = TRUE, sep = ",")
testdata = read.csv("pml-testing.csv",header = TRUE, sep = ",")
```
it appears that the predictor variables should be following as all other variables are just either description or min, max, skewness, kurtosis, average, gyros, etc.
roll_belt pitch_belt yaw_belt total_accel_belt
roll_arm pitch_arm yaw_arm total_accel_arm
roll_dumbbell pitch_dumbbell yaw_dumbbell
roll_forearm pitch_forearm yaw_forearm
So based on above understanding, we have 14 independent variable and a dependent variable (classe). Also looking through the dependent variable, it is sure that is a case of multi-class classification problem
All the independent variables are continuous (numerical) and only the dependent variable is
```{r}
list = cbind(8,9,10,11,46,47,48,49,84,85,86,122,123,124,160) ## the 14 independent variable and one dependent variable
traindata = traindata[,list]
testdata = testdata[,list]
traindata$classe = as.factor(traindata$classe)
set.seed(1234)
traincor = traindata[,-15]
```
Exploration of variables
```{r, echo=FALSE}
cor(traincor)
```
testdata is actually data where we are supposed to do the prediction
So, we will make use of traindata to create training and test set for creating the model: train1 and test1
```{r}
set.seed(1234)
trainIndex = sample(1:dim(traindata)[1],size=dim(traindata)[1]/2,replace=F)
train1 = traindata[trainIndex,]
test1 = traindata[-trainIndex,]
```
Modeling and Prediction
So having known that is a case of multi-class classification, it makes sense to use classifiers like random forest, logistic regression and gradient boost method
1. Decision Tree
```{r}
Modtree = train(classe ~ . , data =train1, method ="rpart")
predicttesttree = predict(Modtree, test1)
```
Out of sample error based on decision tree
```{r, echo=FALSE}
table(predicttesttree,test1$classe)
```
The confusion matrix based on decision tree is not good enough and already shows a lot of missclassification
So, we try another algorithm
2. Random Forest
```{r}
ModRF = randomForest(classe ~ .,train1, ntree = 50, norm.votes = FALSE, importance = TRUE)
predicttestRF = predict(ModRF,test1)
```
In sample error based on random forest
```{r, echo=FALSE}
ModRF$confusion
```
Confusion matrix is very good
Out of sample error
```{r, echo=FALSE}
table(predicttestRF, test1$classe)
```
An insight into variable importance does tell which predictors are more important
```{r, echo=FALSE}
importance(ModRF)
```
Prediction on the testdata
```{r}
predicttestdata = predict(ModRF, testdata)
```
```{r, echo=FALSE}
table(testdata$problem_id, predicttestdata)
```
The random forest performs perfectly and has all the prediction based on the submission