assignment1_Perreault.rmd

---
title: 'Bios 6301: Assignment 1'
author: 'Andrea Perreault'
output: html_document
---

*(informally) Due Thursday, 08 September, 1:00 PM*

50 points total.

This assignment won't be submitted until we've covered Rmarkdown.
Place your R code in between the appropriate chunks for each question.
Check your output by using the `Knit HTML` button in RStudio.

### Create a Data Set

A data set in R is called a data.frame.  This particular data set is
made of three categorical variables, or factors: `gender`, `smoker`,
and `exercise`.  In addition `exercise` is an ordered factor.  `age`
and `los` (length of stay) are continuous variables.

```{r}
gender <- c('M','M','F','M','F','F','M','F','M')
age <- c(34, 64, 38, 63, 40, 73, 27, 51, 47)
smoker <- c('no','yes','no','no','yes','no','no','no','yes')
exercise <- factor(c('moderate','frequent','some','some','moderate','none','none','moderate','moderate'),
                    levels=c('none','some','moderate','frequent'), ordered=TRUE
)
los <- c(4,8,1,10,6,3,9,4,8)
x <- data.frame(gender, age, smoker, exercise, los)
x
```

### Create a Model

We can create a model using our data set.  In this case I’d like to
estimate the association between `los` and all remaining variables.
This means `los` is our dependent variable. The other columns will be
terms in our model.

The `lm` function will take two arguments, a formula and a data set.
The formula is split into two parts, where the vector to the left of
`~` is the dependent variable, and items on the right are terms.

```{r}
lm(los ~ gender + age + smoker + exercise, dat=x)
```

1. Looking at the output, which coefficient seems to have the highest
effect on `los`? (2 points)

```
The coefficient for gender seems to have the highest impact on 'los'.
```

This can be tough because it also depends on the scale of the
variable.  If all the variables are standardized, then this is not
the case.

Given that we only have nine observations, it's not really a good idea
to include all of our variables in the model.  In this case we'd be
"over-fitting" our data.  Let's only include one term, `gender`.

#### Warning

When choosing terms for a model, use prior research, don't just
select the variable with the highest coefficient.

2. Create a model using `los` and `gender` and assign it to the
variable `mod`.  Run the `summary` function with `mod` as its
argument. (5 points)

```{r}
mod <- lm(los ~ gender, dat=x)
summary(mod)
```

The summary of our model reports the parameter estimates along with
standard errors, test statistics, and p-values.  This table of
estimates can be extracted with the `coef` function.

### Estimates

3. What is the estimate for the intercept?  What is the estimate for
gender?  Use the `coef` function. (3 points)

```{r}
mod.c <- coef(summary(mod))
mod.c[,1]
```
```
The output of the 'coef' function results in a 3.5 estimate for the intercept and a 4.3 estimate for gender.
```

4. The second column of `coef` are standard errors.  These can be
calculated by taking the `sqrt` of the `diag` of the `vcov` of the
`summary` of `mod`.  Calculate the standard errors. (3 points)

```{r}
se1 <- sqrt(diag(vcov(summary(mod))))
se1
mod.c[,2] # confirmation
```

The third column of `coef` are test statistics.  These can be
calculated by dividing the first column by the second column.

```{r}
mod.c[,1]/mod.c[,2]
mod.c[,3] # confirmation
```

The fourth column of `coef` are p values.  This captures the
probability of observing a more extreme test statistic.  These can be
calculated with the `pt` function, but you will need the
degrees-of-freedom.  For this model, there are 7 degrees-of-freedom.

5. Use the `pt` function to calculate the p value for gender.  The first
argument should be the test statistic for gender.  The second argument
is the degrees-of-freedom.  Also, set the `lower.tail` argument to
FALSE.  Finally multiple this result by two. (4 points)

```{r}
pval <- (pt(mod.c[,3], 7, lower.tail=FALSE))*2
pval
mod.c[,4] # confirmation
```

### Predicted Values

The estimates can be used to create predicted values.

```{r}
3.5+(x$gender=='M')*4.3
```

6. It is even easier to see the predicted values by passing the model
`mod` to the `predict` or `fitted` functions.  Try it out. (2 points)

```{r}
predict(mod)
fitted(mod)
```

7. `predict` can also use a new data set.  Pass `newdat` as the second
argument to `predict`. (3 points)

```{r}
newdat <- data.frame(gender=c('F','M','F'))
predict(mod, newdat)
```

### Residuals

The difference between predicted values and observed values are
residuals.

8. Use one of the methods to generate predicted values.  Subtract the
predicted value from the `x$los` column. (5 points)

```{r}
pred <- predict(mod)
pred
res <- los[]-pred[]
res
```

9. Try passing `mod` to the `residuals` function. (2 points)

```{r}
residuals(mod)
res # confirmation
```

10. Square the residuals, and then sum these values.  Compare this to the
result of passing `mod` to the `deviance` function. (6 points)

```{r}
dev <- sum(res[] ^ 2)
deviance(mod)
dev # confirmation
```

Remember that our model object has two items in the formula, `los`
and `gender`.  The residual degrees-of-freedom is the number of
observations minus the number of items to account for in the model
formula.

This can be seen by passing `mod` to the function `df.residual`.

```{r}
df.residual(mod)
```

11. Calculate standard error by dividing the deviance by the
degrees-of-freedom, and then taking the square root.  Verify that this
matches the output labeled "Residual standard error" from
`summary(mod)`. (5 points)

```{r}
se2 <- sqrt(dev/df.residual(mod))
se2
summary(mod) # confirmation
```

Note it will also match this output:

```{r}
predict(mod, se.fit=TRUE)$residual.scale
```

### T-test

Let's compare the results of our model to a two-sample t-test.  We
will compare `los` by men and women.

12. Create a subset of `x` by taking all records where gender is 'M'
and assigning it to the variable `men`.  Do the same for the variable
`women`. (4 points)

```{r}
men <- subset(x, gender=='M')
men
women <- subset(x, gender=='F')
women
```

13. By default a two-sampled t-test assumes that the two groups have
unequal variances.  You can calculate variance with the `var`
function.  Calculate variance for `los` for the `men` and `women` data
sets. (3 points)

```{r}
var(men['los'])
var(men[,5]) # confirmation
var(women['los'])
var(women[,5]) # confirmation
```

14. Call the `t.test` function, where the first argument is `los` for
women and the second argument is `los` for men.  Call it a second time
by adding the argument `var.equal` and setting it to TRUE.  Does
either produce output that matches the p value for gender from the
model summary? (3 points)

```{r}
t1 <- t.test(women['los'],men['los'])
t1
t2 <- t.test(women['los'],men['los'],var.equal=TRUE)
t2
summary(mod)
```
```
The t-test with var.equal set to TRUE results in the same p value as the model summary, while the other (t1) is 0.0004 less.
```


An alternative way to call `t.test` is to use a formula.

```{r}
t.test(los ~ gender, dat=x, var.equal=TRUE)
# compare p-values
t.test(los ~ gender, dat=x, var.equal=TRUE)$p.value
coef(summary(lm(los ~ gender, dat=x)))[2,4]
```