-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathassignment1_Perreault-graded.Rmd
268 lines (204 loc) · 7.08 KB
/
assignment1_Perreault-graded.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
title: 'Bios 6301: Assignment 1'
author: 'Andrea Perreault'
output: html_document
---
Grade: 50/50
*(informally) Due Thursday, 08 September, 1:00 PM*
50 points total.
This assignment won't be submitted until we've covered Rmarkdown.
Place your R code in between the appropriate chunks for each question.
Check your output by using the `Knit HTML` button in RStudio.
### Create a Data Set
A data set in R is called a data.frame. This particular data set is
made of three categorical variables, or factors: `gender`, `smoker`,
and `exercise`. In addition `exercise` is an ordered factor. `age`
and `los` (length of stay) are continuous variables.
```{r}
gender <- c('M','M','F','M','F','F','M','F','M')
age <- c(34, 64, 38, 63, 40, 73, 27, 51, 47)
smoker <- c('no','yes','no','no','yes','no','no','no','yes')
exercise <- factor(c('moderate','frequent','some','some','moderate','none','none','moderate','moderate'),
levels=c('none','some','moderate','frequent'), ordered=TRUE
)
los <- c(4,8,1,10,6,3,9,4,8)
x <- data.frame(gender, age, smoker, exercise, los)
x
```
### Create a Model
We can create a model using our data set. In this case I’d like to
estimate the association between `los` and all remaining variables.
This means `los` is our dependent variable. The other columns will be
terms in our model.
The `lm` function will take two arguments, a formula and a data set.
The formula is split into two parts, where the vector to the left of
`~` is the dependent variable, and items on the right are terms.
```{r}
lm(los ~ gender + age + smoker + exercise, dat=x)
```
1. Looking at the output, which coefficient seems to have the highest
effect on `los`? (2 points)
```
The coefficient for gender seems to have the highest impact on 'los'.
```
This can be tough because it also depends on the scale of the
variable. If all the variables are standardized, then this is not
the case.
Given that we only have nine observations, it's not really a good idea
to include all of our variables in the model. In this case we'd be
"over-fitting" our data. Let's only include one term, `gender`.
#### Warning
When choosing terms for a model, use prior research, don't just
select the variable with the highest coefficient.
2. Create a model using `los` and `gender` and assign it to the
variable `mod`. Run the `summary` function with `mod` as its
argument. (5 points)
```{r}
mod <- lm(los ~ gender, dat=x)
summary(mod)
```
The summary of our model reports the parameter estimates along with
standard errors, test statistics, and p-values. This table of
estimates can be extracted with the `coef` function.
### Estimates
3. What is the estimate for the intercept? What is the estimate for
gender? Use the `coef` function. (3 points)
```{r}
mod.c <- coef(summary(mod))
mod.c[,1]
```
```
The output of the 'coef' function results in a 3.5 estimate for the intercept and a 4.3 estimate for gender.
```
4. The second column of `coef` are standard errors. These can be
calculated by taking the `sqrt` of the `diag` of the `vcov` of the
`summary` of `mod`. Calculate the standard errors. (3 points)
```{r}
se1 <- sqrt(diag(vcov(summary(mod))))
se1
mod.c[,2] # confirmation
```
The third column of `coef` are test statistics. These can be
calculated by dividing the first column by the second column.
```{r}
mod.c[,1]/mod.c[,2]
mod.c[,3] # confirmation
```
The fourth column of `coef` are p values. This captures the
probability of observing a more extreme test statistic. These can be
calculated with the `pt` function, but you will need the
degrees-of-freedom. For this model, there are 7 degrees-of-freedom.
5. Use the `pt` function to calculate the p value for gender. The first
argument should be the test statistic for gender. The second argument
is the degrees-of-freedom. Also, set the `lower.tail` argument to
FALSE. Finally multiple this result by two. (4 points)
```{r}
pval <- (pt(mod.c[,3], 7, lower.tail=FALSE))*2
pval
mod.c[,4] # confirmation
```
### Predicted Values
The estimates can be used to create predicted values.
```{r}
3.5+(x$gender=='M')*4.3
```
6. It is even easier to see the predicted values by passing the model
`mod` to the `predict` or `fitted` functions. Try it out. (2 points)
```{r}
predict(mod)
fitted(mod)
```
7. `predict` can also use a new data set. Pass `newdat` as the second
argument to `predict`. (3 points)
```{r}
newdat <- data.frame(gender=c('F','M','F'))
predict(mod, newdat)
```
### Residuals
The difference between predicted values and observed values are
residuals.
8. Use one of the methods to generate predicted values. Subtract the
predicted value from the `x$los` column. (5 points)
```{r}
pred <- predict(mod)
pred
res <- los[]-pred[]
res
```
9. Try passing `mod` to the `residuals` function. (2 points)
```{r}
residuals(mod)
res # confirmation
```
10. Square the residuals, and then sum these values. Compare this to the
result of passing `mod` to the `deviance` function. (6 points)
```{r}
dev <- sum(res[] ^ 2)
deviance(mod)
dev # confirmation
```
Remember that our model object has two items in the formula, `los`
and `gender`. The residual degrees-of-freedom is the number of
observations minus the number of items to account for in the model
formula.
This can be seen by passing `mod` to the function `df.residual`.
```{r}
df.residual(mod)
```
11. Calculate standard error by dividing the deviance by the
degrees-of-freedom, and then taking the square root. Verify that this
matches the output labeled "Residual standard error" from
`summary(mod)`. (5 points)
```{r}
se2 <- sqrt(dev/df.residual(mod))
se2
summary(mod) # confirmation
```
Note it will also match this output:
```{r}
predict(mod, se.fit=TRUE)$residual.scale
```
### T-test
Let's compare the results of our model to a two-sample t-test. We
will compare `los` by men and women.
12. Create a subset of `x` by taking all records where gender is 'M'
and assigning it to the variable `men`. Do the same for the variable
`women`. (4 points)
```{r}
men <- subset(x, gender=='M')
men
women <- subset(x, gender=='F')
women
```
13. By default a two-sampled t-test assumes that the two groups have
unequal variances. You can calculate variance with the `var`
function. Calculate variance for `los` for the `men` and `women` data
sets. (3 points)
```{r}
var(men['los'])
var(men[,5]) # confirmation
var(women['los'])
var(women[,5]) # confirmation
```
14. Call the `t.test` function, where the first argument is `los` for
women and the second argument is `los` for men. Call it a second time
by adding the argument `var.equal` and setting it to TRUE. Does
either produce output that matches the p value for gender from the
model summary? (3 points)
```{r}
t1 <- t.test(women['los'],men['los'])
t1
t2 <- t.test(women['los'],men['los'],var.equal=TRUE)
t2
summary(mod)
```
```
The t-test with var.equal set to TRUE results in the same p value as the model summary, while the other (t1) is 0.0004 less.
```
An alternative way to call `t.test` is to use a formula.
```{r}
t.test(los ~ gender, dat=x, var.equal=TRUE)
# compare p-values
t.test(los ~ gender, dat=x, var.equal=TRUE)$p.value
coef(summary(lm(los ~ gender, dat=x)))[2,4]
```