-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path01-05-independent-samples-t-test.Rmd
135 lines (97 loc) · 10.6 KB
/
01-05-independent-samples-t-test.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# Independent Samples *t*-Test {#independent-samples-t-test}
The independent samples *t*-test compares the means of two different (or independent) samples.
For example, let's say that we were interested in determining if the salary of professors was different depending on whether they were part of an applied or theoretical discipline.
For this example, we will use the [`datasetSalaries`][Salaries Dataset] data set.
## Null and research hypotheses
### Traditional approach
<center>
$H_0: \mu_{Applied} - \mu_{Theoretical} = 0$ or equivalently $\mu_{Applied} = \mu_{Theoretical}$
<BR>$H_1: \mu_{Applied} - \mu_{Theoretical} \ne 0$ or equivalently $\mu_{Applied} \ne \mu_{Theoretical}$
</center>
<br />
The null hypothesis states there is no difference in the salary of professors who are in an applied discipline compared to a theoretical discipline. The research hypothesis states there is a difference in the salary of professors who are in an applied discipline compared to a theoretical discipline.
### GLM approach
$$Model: Salary = \beta_0 + \beta_1*Discipline + \varepsilon$$
$$H_0: \beta_1 = 0$$
$$H_1: \beta_1 \ne 0$$
In addition to the intercept ($\beta_0$), we now have a predictor `discipline` along with its associated slope ($\beta_1$). In this model, the slope represents the change in `salary` over the change in `discipline`, and the intercept ($\beta_{0}$) represents the value when `discipline` is 0.
The null hypothesis states that the slope associated with `discipline` is equal to zero. In other words, there is no difference in the salary of professors who are in different disciplines. The alternative hypothesis states that the slope associated with `discipline` is not equal to zero. In other words, there is a difference in the salary of professors that are in different disciplines.
The interpretation of the slope and intercept depends on how `discipline` is coded. Thus, it is always a good idea to check how this categorical IV is coded, which can be done using the `contrasts()` function.
```{r}
contrasts(datasetSalaries$discipline)
```
We can see that `Applied` is coded as `0` and `Theoretical` is coded as `1`. Given, that the difference of coding scheme of `discipline` is 1, the slope represents the mean difference in salary for professors that are in theoretical disciplines compared to applied disciplines [^9]. Additionally, the intercept ($\beta_{0}$) represents the mean salary of professors in the applied discipline since 0 represents `Applied` in the `discipline` coding scheme.
[^9]: $$b_1 = \frac{\Delta Y}{\Delta X} = \frac{\Delta Salary}{\Delta Discipline} = \frac{\Delta Salary}{1-0} = \frac{\Delta Salary}{1} = \Delta Salary$$
If we wanted to change the coding scheme and code `Theoretical` as `0` and `Applied` as `1`, then our interpretation of the intercept would be the mean salary of professors in the theoretical discipline since `Theoretical` is now 0. The slope would still have the same interpretation of the mean difference of salary for professors in different disciplines as the difference in the coding scheme is still 1. However, the sign would change. [^10]
[^10]: $$b_1 = \frac{\Delta Y}{\Delta X} = \frac{\Delta Salary}{\Delta Discipline} = \frac{\Delta Salary}{0-1} = \frac{\Delta Salary}{-1} = -\Delta Salary$$
These two types of coding schemes are known as dummy coding, which is R's default coding scheme for categorical variables. Specifically, dummy coding is when one level of an IV is coded as 1 and all others are coded as 0. However, there are other coding schemes such as effects (also known deviant), helmert, polynomial, and orthogonal. For a good description of different contrasts in addition to applying and interpreting them, check out UCLA Statistical Consulting Group's <a href="https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/" target="_blank">description</a>. We will also go over coding categorical variables in more detail in the [next chapter][Coding Categorical Variables].
Our preferred contrast for the independent samples *t*-test is to use -0.5 for one group and 0.5 for the other group. We prefer this coding scheme because the slope will still provide the mean difference [^11]; however, since 0 lies in between the two groups, the intercept will now represent the mean of the group means. In this example case, it will be the mean of the mean salary of professors in the applied discipline and the mean salary of professors in the theoretical discipline.
[^11]: $$b_1 = \frac{\Delta Y}{\Delta X} = \frac{\Delta Salary}{\Delta Discipline} = \frac{\Delta Salary}{0.5-(-0.5)} = \frac{\Delta Salary}{1} = \Delta Salary$$
We recommend assigning the group expected to have a higher value in the dependent variable as 0.5 and the other group as -0.5. In our example, we might expect that those in the applied discipline will have higher salaries and thus assign that group to 0.5, while the theoretical discipline to have lower salaries and thus assign them to -0.5. We can do this by using the concatenate `c()` function to group the numbers together and assign them back to the contrast. The order inside the `c()` function must be in alphanumerical order of the levels of the independent variable.
```{r}
contrasts(datasetSalaries$discipline) <- c(0.5, -0.5)
contrasts(datasetSalaries$discipline)
```
Note: If the difference in `discipline` was not equal to 1, the estimate would equal the fraction of the difference. For example, if `Applied` was coded as `-1` and `Theoretical` was coded as `1`, the difference of `discipline` is now 2 and the estimate would represent half of the salary mean difference [^12]. Thus, if we multiplied the estimate by 2, we would obtain the mean salary difference. Even though the estimate changes, the *t*-statistic and *p*-value will not change as the intercept and error will adjust proportionally to the coding scheme of the categorical IV (as long as they are unique values).
[^12]: $$b_1 = \frac{\Delta Y}{\Delta X} = \frac{\Delta Salary}{\Delta Discipline} = \frac{\Delta Salary}{1-(-1)} = \frac{\Delta Salary}{2}$$
## Statistical analysis
### Traditional approach
To perform the traditional independent samples *t*-test, we can again use the `t.test()` function. However, we will now enter the formula of the GLM into the first argument. For this test, we will assume the variances of each group are equal (not significantly different from each other); however, this should be tested.
```{r}
t.test(datasetSalaries$salary ~ datasetSalaries$discipline, var.equal = TRUE)
```
Note: There are other ways to enter the formula into `t.test()` function depending on how the dataset is formatted.
From this output we can see that the *t*-statistic (`t`) is `3.1406`, degrees of freedom (`df`) is `395`, and `p-value` is `0.001813`, with the mean salary of professors in the `Applied` discipline being `$118,028.70` and the mean salary of professors in the `Theoretical` discipline being `$108,548.40`. Therefore, professors with `Applied` disciplines earn significantly higher salaries than professors with `Theoretical` disciplines.
### GLM approach
```{r}
model <- lm(salary ~ 1 + discipline, datasetSalaries)
```
```{r}
summary(model)
```
Notice that in both analyses, the *t*-statistic (`t value`) of `-3.14` with `395` degrees of freedom (df), and *p*-value of `.002` are identical to the output from the `t.test()` function.
We can also see that if we subtract the mean salary for professors in the theoretical discipline from the applied discipline from the `t.test()` results, we obtain the same mean difference in the `estimate` in the GLM results (i.e., `$118,028.70` - `$108,548.40` = `$9,480.30`).
Furthermore, we can also see that the intercept is the mean of the mean salary from both those in the applied discipline and theoretical discipline ($\frac{118028.70+108548.40}{2} = 113288.50$).
## Statistical decision
Given the *p*-value of `.002` is less than the alpha level ($\alpha$) of 0.05, we will reject the null hypothesis.
## APA statement
An independent samples *t*-test was performed to test if salary of professors was different depending on their discipline. The salary of professors was significantly higher for professors in applied disciplines (*M* = \$`r format(scales::comma(round(describeBy(datasetSalaries$salary,datasetSalaries$discipline)[["Applied"]][,"mean"],2)),scientific=F)`, *SD* = \$`r format(scales::comma(round(describeBy(datasetSalaries$salary,datasetSalaries$discipline)[["Applied"]][,"sd"],2)),scientific=F)`) than for professors in theoretical disciplines (*M* = \$`r format(scales::comma(round(describeBy(datasetSalaries$salary,datasetSalaries$discipline)[["Theoretical"]][,"mean"],2)),scientific=F)`, *SD* = \$`r format(scales::comma(round(describeBy(datasetSalaries$salary,datasetSalaries$discipline)[["Theoretical"]][,"sd"],2)),scientific=F)`), *t*(395) = -3.14, *p* = .002.
## Visualization
```{r fig-independent-samples-t-test, warning = F, fig.cap = "A dot plot of the 9-month academic salaries of professors that are in applied compared to theoretical disciplines. With respect to each discipline, the dot represents the mean salary and the bars represent the 95% CI. \nNote: The data points of each group are actually only on a single line on the x-axis. They are only jittered (dispersed) for easier visualization of all data points."}
# calculate descriptive statistics along with the 95% CI
dataset_summary <- datasetSalaries %>%
mutate(discipline = ifelse(discipline == "Applied", 0.5, -0.5)) %>%
group_by(discipline) %>%
summarize(
mean = mean(salary),
sd = sd(salary),
n = n(),
sem = sd / sqrt(n),
tcrit = abs(qt(0.05 / 2, df = n - 1)),
ME = tcrit * sem,
LL95CI = mean - ME,
UL95CI = mean + ME
)
mean_of_means <- mean(dataset_summary$mean)
# plot
datasetSalaries %>%
mutate(discipline = ifelse(discipline == "Applied", 0.5, -0.5)) %>%
ggplot(., aes(discipline, salary)) +
geom_jitter(alpha = 0.1, width = 0.05) +
geom_line(data = dataset_summary, aes(x = discipline, y = mean), color = "#3182bd") +
geom_errorbar(data = dataset_summary, aes(x = discipline, y = mean, ymin = LL95CI, ymax = UL95CI), width = 0.02, color = "#3182bd") +
geom_point(data = dataset_summary, aes(x = discipline, y = mean), size = 3, color = "#3182bd") +
geom_point(aes(x = 0, y = mean_of_means), size = 3, color = "#3182bd") +
labs(
x = "Discipline",
y = "9-Month Academic Salary (USD)",
caption = ""
) +
theme_classic() +
scale_y_continuous(
labels = scales::dollar
) +
scale_x_continuous(breaks = c(-1,-.5,0,.5,1)) +
annotate(geom = "text", x = -.5, y = 0, label = "Applied", size = 4) +
annotate(geom = "text", x = .5, y = 0, label = "Theoretical", size = 4)
```