-
Notifications
You must be signed in to change notification settings - Fork 272
/
Copy path07-ch7.Rmd
797 lines (553 loc) · 44.8 KB
/
07-ch7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
# Hypothesis Tests and Confidence Intervals in MR Models {#htaciimr}
This chapter discusses methods that allow to quantify the sampling uncertainty in the OLS estimator of the coefficients in multiple regression models. The basis for this are hypothesis tests and confidence intervals which, just as for the simple linear regression model, can be computed using basic `r ttcode("R")` functions. We will also tackle the issue of testing joint hypotheses on these coefficients.
Make sure the packages `r ttcode("AER")` [@R-AER] and `r ttcode("stargazer")` [@R-stargazer] are installed before you go ahead and replicate the examples. The safest way to do so is by checking whether the following code chunk executes without any issues.
```{r, warning=FALSE, message=FALSE, eval=FALSE}
library(AER)
library(stargazer)
```
## Hypothesis Tests and Confidence Intervals for a Single Coefficient
We will first discuss how to compute standard errors, test hypotheses and construct confidence intervals for a single regression coefficient $\beta_j$ in a multiple regression model. The basic idea is summarized in Key Concept 7.1.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC7.1">
<h3 class = "right"> Key Concept 7.1 </h3>
<h3 class = "left"> Testing the Hypothesis $\\beta_j = \\beta_{j,0}$ <br>
Against the Alternative $\\beta_j \\neq \\beta_{j,0}$ </h3>
<p>
1. Compute the standard error of $\\hat{\\beta_j}$.
2. Compute the $t$-statistic,
$$t^{act} = \\frac{\\hat{\\beta}_j - \\beta_{j,0}} {SE(\\hat{\\beta_j})}$$.
3. Compute the $p$-value,
$$p\\text{-value} = 2 \\Phi(-|t^{act}|)$$
where $t^{act}$ is the value of the $t$-statistic actually computed. Reject the hypothesis at the $5\\%$ significance level if the $p$-value is less than $0.05$ or, equivalently, if $|t^{act}| > 1.96$. The standard error and (typically) the $t$-statistic and the corresponding $p$-value for testing $\\beta_j = 0$ are computed automatically by suitable <tt>R</tt> functions, e.g., by <tt>summary</tt>.
</p>
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Testing the Hypothesis $\\beta_j = \\beta_{j,0}$
Against the Alternative $\\beta_j \\neq \\beta_{j,0}$]{7.1}
\\begin{enumerate}
\\item Compute the standard error of $\\hat{\\beta_j}$.
\\item Compute the $t$-statistic,
$$t^{act} = \\frac{\\hat{\\beta}_j - \\beta_{j,0}} {SE(\\hat{\\beta_j})}$$.
\\item Compute the $p$-value,
$$p\\text{-value} = 2 \\Phi(-|t^{act}|)$$
where $t^{act}$ is the value of the $t$-statistic actually computed. Reject the hypothesis at the $5\\%$ significance level if the $p$-value is less than $0.05$ or, equivalently, if $|t^{act}| > 1.96$. \\end{enumerate}\\vspace{0.5cm}
The standard error and (typically) the $t$-statistic and the corresponding $p$-value for testing $\\beta_j = 0$ are computed automatically by suitable \\texttt{R} functions, e.g., by \\texttt{summary()}.
\\end{keyconcepts}
')
```
Testing a single hypothesis about the significance of a coefficient in the multiple regression model proceeds similarly to the process in the simple regression model.
You can easily see this by inspecting the coefficient summary of the regression model
$$ TestScore = \beta_0 + \beta_1 \times STR \beta_2 \times english + u $$
already discussed in Chapter \@ref(rmwmr). Let us review this:
```{r, echo=5:7, warning=F, message=F}
library(AER)
data(CASchools)
CASchools$STR <- CASchools$students/CASchools$teachers
CASchools$score <- (CASchools$read + CASchools$math)/2
model <- lm(score ~ STR + english, data = CASchools)
coeftest(model, vcov. = vcovHC, type = "HC1")
```
You can verify that these quantities are computed as in the simple regression model by manually calculating the $t$-statistics or $p$-values using the provided output above and using `r ttcode("R")` as a calculator.
For example, using the definition of the $p$-value for a two-sided test as given in Key Concept 7.1, we can confirm the $p$-value for a test of the hypothesis that the coefficient $\beta_1$, the coefficient on `r ttcode("size")`, to be approximately zero.
```{r, warning=F, message=F}
# compute two-sided p-value
2 * (1 - pt(abs(coeftest(model, vcov. = vcovHC, type = "HC1")[2, 3]),
df = model$df.residual))
```
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC7.2">
<h3 class = "right"> Key Concept 7.2 </h3>
<h3 class = "left"> Confidence Intervals for a Single Coefficient in Multiple Regression </h3>
<p>
A $95\\%$ two-sided confidence interval for the coefficient $\\beta_j$ is an interval that contains the true value of $\\beta_j$ with a $95 \\%$ probability; that is, it contains the true value of $\\beta_j$ in $95 \\%$ of all repeated samples. Equivalently, it is the set of values of $\\beta_j$ that cannot be rejected by a $5 \\%$ two-sided hypothesis test. When the sample size is large, the $95 \\%$ confidence interval for $\\beta_j$ is
$$\\left[\\hat{\\beta_j}- 1.96 \\times SE(\\hat{\\beta}_j), \\hat{\\beta_j} + 1.96 \\times SE(\\hat{\\beta_j})\\right].$$
</p>
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Confidence Intervals for a Single Coefficient in Multiple Regression]{7.2}
A $95\\%$ two-sided confidence interval for the coefficient $\\beta_j$ is an interval that contains the true value of $\\beta_j$ with a $95 \\%$ probability; that is, it contains the true value of $\\beta_j$ in $95 \\%$ of repeated samples. Equivalently, it is the set of values of $\\beta_j$ that cannot be rejected by a $5 \\%$ two-sided hypothesis test. When the sample size is large, the $95 \\%$ confidence interval for $\\beta_j$ is
$$\\left[\\hat{\\beta_j}- 1.96 \\times SE(\\hat{\\beta}_j), \\hat{\\beta_j} + 1.96 \\times SE(\\hat{\\beta_j})\\right].$$
\\end{keyconcepts}
')
```
## An Application to Test Scores and the Student-Teacher Ratio
Let us take a look at the regression from Section \@ref(mofimr) again.
Computing confidence intervals for individual coefficients in the multiple regression model proceeds as in the simple regression model using the function `r ttcode("confint()")`.
```{r, warning=F, message=F}
model <- lm(score ~ STR + english, data = CASchools)
confint(model)
```
To obtain confidence intervals at another level, say $90\%$, just set the argument `r ttcode("level")` in our call of `r ttcode("confint()")` accordingly.
```{r, warning=F, message=F}
confint(model, level = 0.9)
```
The output now reports the desired $90\%$ confidence intervals for all coefficients.
One drawback of using `r ttcode("confint()")` is that it doesn't utilize robust standard errors for calculating the confidence interval. To create large-sample confidence intervals that account for robust standard errors, you can easily do manually using the following approach.
```{r, warning=F, message=F}
# compute robust standard errors
rob_se <- diag(vcovHC(model, type = "HC1"))^0.5
# compute robust 95% confidence intervals
rbind("lower" = coef(model) - qnorm(0.975) * rob_se,
"upper" = coef(model) + qnorm(0.975) * rob_se)
# compute robust 90% confidence intervals
rbind("lower" = coef(model) - qnorm(0.95) * rob_se,
"upper" = coef(model) + qnorm(0.95) * rob_se)
```
Knowing how to use `r ttcode("R")` to make inference about the coefficients in multiple regression models, you can now answer the following question:
Can the null hypothesis that a change in the student-teacher ratio, `r ttcode("STR")`, has no significant influence on test scores, `r ttcode("scores")`, --- if we control for the percentage of students learning English in the district, `r ttcode("english")`, --- be rejected at the $10\%$ and the $5\%$ level of significance?
The output above shows that zero is not an element of the confidence interval for the coefficient on `r ttcode("STR")` such that we can reject the null hypothesis at significance levels of $5\%$ and $10\%$. The same conclusion can be made via the $p$-value for `r ttcode("STR")`: $0.00398 < 0.05 = \alpha$.
Note that rejection at the $5\%$-level implies rejection at the $10\%$ level (why?).
Recall from Chapter \@ref(cifrc) the $95\%$ confidence interval computed above *does not* tell us that a one-unit decrease in the student-teacher ratio has an effect on test scores that lies in the interval with a lower bound of $-1.9497$ and an upper bound of $-0.2529$. Once a confidence interval has been computed, a probabilistic statement like this is wrong: either the interval contains the true parameter or it does not. We do not know which is true.
### Another Augmentation of the Model {-}
What is the average effect on test scores of reducing the student-teacher ratio when the expenditures per pupil and the percentage of english learning pupils are held constant?
Let us augment our model by an additional regressor that is a measure for expenditure per pupil. Using `r ttcode("?CASchools")` we find that `r ttcode("CASchools")` contains the variable `r ttcode("expenditure")`, which provides expenditure per student.
Our model now is $$ TestScore = \beta_0 + \beta_1 \times STR + \beta_2 \times english + \beta_3 \times expenditure + u $$
with $expenditure$ being the total amount of expenditure per pupil in the district (thousands of dollars).
Let us now estimate the model:
```{r, echo=7:12, warning=F, message=F}
library(AER)
data(CASchools)
CASchools$size <- CASchools$students/CASchools$teachers
CASchools$STR <- CASchools$students/CASchools$teachers
CASchools$score <- (CASchools$read + CASchools$math)/2
# scale expenditure to thousands of dollars
CASchools$expenditure <- CASchools$expenditure/1000
# estimate the model
model <- lm(score ~ STR + english + expenditure, data = CASchools)
coeftest(model, vcov. = vcovHC, type = "HC1")
```
The estimated effect of a one unit change in the student-teacher ratio on test scores with expenditure and the share of english learning pupils held constant is $-0.29$, which is rather small. What is more, the coefficient on $STR$ is not significantly different from zero anymore even at $10\%$ since $p\text{-value}=0.55$. Can you come up with an interpretation for these findings (see Chapter 7.1 of the book)? The insignificance of $\hat\beta_1$ could be due to a larger standard error of $\hat{\beta}_1$ resulting from adding $expenditure$ to the model so that we estimate the coefficient on $size$ less precisely. This illustrates the issue of strongly correlated regressors (imperfect multicollinearity). The correlation between $STR$ and $expenditure$ can be computed using `r ttcode("cor()")`.
```{r}
# compute the sample correlation between 'STR' and 'expenditure'
cor(CASchools$STR, CASchools$expenditure)
```
Altogether, we conclude that the new model provides no evidence that changing the student-teacher ratio, e.g., by hiring new teachers, has any effect on the test scores while keeping expenditures per student and the share of English learners constant.
## Joint Hypothesis Testing using the F-Statistic
The estimated model is
$$ \widehat{TestScore} = \underset{(15.21)}{649.58} -\underset{(0.48)}{0.29} \times STR - \underset{(0.04)}{0.66} \times english + \underset{(1.41)}{3.87} \times expenditure. $$
Now, can we reject the hypothesis that the coefficient on $size$ and the coefficient on $expenditure$ are zero? To answer this, we have to resort to joint hypothesis tests. A joint hypothesis imposes restrictions on multiple regression coefficients. This is different from conducting individual $t$-tests where a restriction is imposed on a single coefficient. Chapter 7.2 of the book explains why testing hypotheses about the model coefficients one at a time is different from testing them jointly.
The homoskedasticity-only $F$-Statistic is given by
$$ F = \frac{(SSR_{\text{restricted}} - SSR_{\text{unrestricted}})/q}{SSR_{\text{unrestricted}} / (n-k-1)} $$
with $SSR_{restricted}$ being the sum of squared residuals from the restricted regression, i.e., the regression where we impose the restriction. $SSR_{unrestricted}$ is the sum of squared residuals from the full model, $q$ is the number of restrictions under the null and $k$ is the number of regressors in the unrestricted regression.
It is fairly easy to conduct $F$-tests in `r ttcode("R")`. We can use the function `r ttcode("linearHypothesis()")`contained in the package `r ttcode("car")`.
```{r, warning=F, message=F}
# estimate the multiple regression model
model <- lm(score ~ STR + english + expenditure, data = CASchools)
# execute the function on the model object and provide both linear restrictions
# to be tested as strings
linearHypothesis(model, c("STR=0", "expenditure=0"))
```
The output reveals that the $F$-statistic for this joint hypothesis test is about $8.01$ and the corresponding $p$-value is $0.0004$. Thus, we can reject the null hypothesis that both coefficients are zero at any level of significance commonly used in practice.
A heteroskedasticity-robust version of this $F$-test (which leads to the same conclusion) can be conducted as follows:
```{r, warning=F, message=F}
# heteroskedasticity-robust F-test
linearHypothesis(model, c("STR=0", "expenditure=0"), white.adjust = "hc1")
```
The standard output of a model summary also reports an $F$-statistic and the corresponding $p$-value. The null hypothesis belonging to this $F$-test is that *all* of the population coefficients in the model except for the intercept are zero, so the hypotheses are $$H_0: \beta_1=0, \ \beta_2 =0, \ \beta_3 =0 \quad \text{vs.} \quad H_1: \beta_j \neq 0 \ \text{for at least one} \ j=1,2,3.$$
This is also called the *overall regression $F$-statistic* and the null hypothesis is obviously different from testing if only $\beta_1$ and $\beta_3$ are zero.
We now check whether the $F$-statistic belonging to the $p$-value listed in the model's summary coincides with the result reported by `r ttcode("linearHypothesis()")`.
```{r, warning=F, message=F}
# execute the function on the model object and provide the restrictions
# to be tested as a character vector
linearHypothesis(model, c("STR=0", "english=0", "expenditure=0"))
# Access the overall F-statistic from the model's summary
summary(model)$fstatistic
```
The entry `r ttcode("value")` is the overall $F$-statistics and it equals the result of `r ttcode("linearHypothesis()")`. The $F$-test rejects the null hypothesis that the model has no power in explaining test scores. It is important to know that the $F$-statistic reported by `r ttcode("summary")` is *not* robust to heteroskedasticity.
## Confidence Sets for Multiple Coefficients
Based on the $F$-statistic that we have previously encountered, we can specify confidence sets. Confidence sets are analogous to confidence intervals for single coefficients. As such, confidence sets consist of *combinations* of coefficients that contain the true combination of coefficients in, say, $95\%$ of all cases if we could repeatedly draw random samples, just like in the univariate case. Put differently, a confidence set is the set of all coefficient combinations for which we cannot reject the corresponding joint null hypothesis tested using an $F$-test.
The confidence set for two coefficients is an ellipse which is centered around the point defined by both coefficient estimates. Again, there is a very convenient way to plot the confidence set for two coefficients of model objects, namely the function `r ttcode("confidenceEllipse()")` from the `r ttcode("car")` package.
We now plot the $95\%$ confidence ellipse for the coefficients on `r ttcode("STR")` and `r ttcode("expenditure")` from the regression conducted above. By specifying the additional argument `r ttcode("fill")`, the confidence set is colored.
```{r, echo=2:3, fig.align = 'center', warning=F, message=F}
model <- lm(score ~ STR + english + expenditure, data = CASchools)
# draw the 95% confidence set for coefficients on STR and expenditure
confidenceEllipse(model,
fill = T,
lwd = 0,
which.coef = c("STR", "expenditure"),
main = "95% Confidence Set",
ylab="Coefficients of Expenditure",
xlab="Coefficients of STR")
```
We see that the ellipse is centered around $(-0.29, 3.87)$, the pair of coefficients estimates on $STR$ and $expenditure$. What is more, $(0,0)$ is not element of the $95\%$ confidence set so that we can reject $H_0: \beta_1 = 0, \ \beta_3 = 0$.
By default, `r ttcode("confidenceEllipse()")` uses homoskedasticity-only standard errors. The following code chunk shows how to compute a robust confidence ellipse and how to overlay it with the previous plot.
```{r, fig.align = 'center', warning=F, message=F}
# draw the robust 95% confidence set for coefficients on STR and expenditure
confidenceEllipse(model,
fill = T,
lwd = 0,
which.coef = c("STR", "expenditure"),
main = "95% Confidence Sets",
vcov. = vcovHC(model, type = "HC1"),
col = "red",
ylab="Coefficients of Expenditure",
xlab="Coefficients of STR")
# draw the 95% confidence set for coefficients on STR and expenditure
confidenceEllipse(model,
fill = T,
lwd = 0,
which.coef = c("STR", "expenditure"),
add = T)
```
As the robust standard errors are slightly larger than those valid under homoskedasticity only in this case, the robust confidence set is slightly larger. This is analogous to the confidence intervals for the individual coefficients.
## Model Specification for Multiple Regression
Choosing a regression specification, i.e., selecting the variables to be included in a regression model, is a difficult task. However, there are some guidelines on how to proceed. The goal is clear: obtaining an unbiased and precise estimate of the causal effect of interest. As a starting point, think about omitted variables, that is, to avoid possible bias by using suitable control variables. Omitted variables bias in the context of multiple regression is explained in Key Concept 7.3. A second step could be to compare different specifications by measures of fit. However, as we shall see one should not rely solely on $\bar{R}^2$.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC7.3">
<h3 class = "right"> Key Concept 7.3 </h3>
<h3 class = "left"> Omitted Variable Bias in Multiple Regression</h3>
<p>
Omitted variable bias is the bias in the OLS estimator that arises when regressors correlate with an omitted variable. For omitted variable bias to arise, two things must be true:
1. At least one of the included regressors must be correlated with the omitted variable.
2. The omitted variable must be a determinant of the dependent variable, $Y$.
</p>
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Omitted Variable Bias in Multiple Regression]{7.3}
Omitted variable bias is the bias in the OLS estimator that arises when regressors correlate with an omitted variable. For omitted variable bias to arise, two things must be true:\\newline
\\begin{enumerate}
\\item At least one of the included regressors must be correlated with the omitted variable.
\\item The omitted variable must be a determinant of the dependent variable, $Y$.
\\end{enumerate}
\\end{keyconcepts}
')
```
We will now discuss an example where we may encounter potential omitted variable bias in a multiple regression model:
Consider again the estimated regression equation
$$ \widehat{TestScore} = \underset{(8.7)}{686.0} - \underset{(0.43)}{1.10} \times STR - \underset{(0.031)}{0.650} \times english. $$
We are interested in estimating the causal effect of class size on test score. There might be a bias due to omitting "outside learning opportunities" from our regression since such a measure could be a determinant of the students' test scores and could also be correlated with both regressors already included in the model (so that both conditions of Key Concept 7.3 are fulfilled). "Outside learning opportunities" are a complicated concept that is difficult to quantify. A surrogate we can consider instead is the students' economic background which are strongly related to outside learning opportunities: think of wealthy parents that are able to provide time and/or money for private tuition of their children. We thus augment the model with the variable `r ttcode("lunch")`, the percentage of students that qualify for a free or subsidized lunch in school due to family incomes below a certain threshold, and reestimate the model.
```{r, echo=6:8, warning=F, message=F}
library(AER)
data(CASchools)
CASchools$STR <- CASchools$students/CASchools$teachers
CASchools$score <- (CASchools$read + CASchools$math)/2
# estimate the model and print the summary to console
model <- lm(score ~ STR + english + lunch, data = CASchools)
coeftest(model, vcov. = vcovHC, type = "HC1")
```
Thus, the estimated regression line is
$$ \widehat{TestScore} = \underset{(5.56)}{700.15} - \underset{(0.27)}{1.00} \times STR - \underset{(0.03)}{0.12} \times english - \underset{(0.02)}{0.55} \times lunch. $$
We observe no substantial changes in the conclusion about the effect of $STR$ on $TestScore$: the coefficient on $STR$ changes by only $0.1$ and retains its significance.
Although the difference in estimated coefficients is not big in this case, it is useful to keep `r ttcode("lunch")` to make the assumption of conditional mean independence more credible (see Chapter 7.5 of the book).
### Model Specification in Theory and in Practice {-}
Key Concept 7.4 lists some common pitfalls when using $R^2$ and $\bar{R}^2$ to evaluate the predictive ability of regression models.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC7.4">
<h3 class = "right"> Key Concept 7.4 </h3>
<h3 class = "left"> $R^2$ and $\\bar{R}^2$: what they tell you --- and what they do not </h3>
<p>
The $R^2$ and $\\bar{R}^2$ tell you whether the regressors are good at explaining the variation of the independent variable in the sample. If the $R^2$ (or $\\bar{R}^2$) is nearly $1$, then the regressors produce a good prediction of the dependent variable in that sample, in the sense that the variance of OLS residuals is small compared to the variance of the dependent variable. If the $R^2$ (or $\\bar{R}^2$) is nearly $0$, the opposite is true.
The $R^2$ and $\\bar{R}^2$ do *not* tell you whether:
1. An included variable is statistically significant.
2. The regressors are the true cause of the movements in the dependent variable.
3. There is omitted variable bias.
4. You have chosen the most appropriate set of regressors.
</p>
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[$R^2$ and $\\bar{R}^2$: What They Tell You --- and What They Do not]{7.4}
The $R^2$ and $\\bar{R}^2$ tell you whether the regressors are good at explaining the variation of the independent variable in the sample. If the $R^2$ (or $\\bar{R}^2$) is nearly $1$, then the regressors produce a good prediction of the dependent variable in that sample, in the sense that the variance of OLS residuals is small compared to the variance of the dependent variable. If the $R^2$ (or $\\bar{R}^2$) is nearly $0$, the opposite is true.\\newline
The $R^2$ and $\\bar{R}^2$ do \\textit{not} tell you whether:\\newline
\\begin{enumerate}
\\item An included variable is statistically significant.
\\item The regressors are the true cause of the movements in the dependent variable.
\\item There is omitted variable bias.
\\item You have chosen the most appropriate set of regressors.
\\end{enumerate}
\\end{keyconcepts}
')
```
For example, think of regressing $TestScore$ on $PLS$ which measures the available parking lot space in thousand square feet. You are likely to observe a significant coefficient of reasonable magnitude and moderate to high values for $R^2$ and $\bar{R}^2$. The reason for this is that parking lot space is correlated with many determinants of the test score like location, class size, financial endowment and so on. Although we do not have observations on $PLS$, we can use `r ttcode("R")` to generate some relatively realistic data.
```{r}
# set seed for reproducibility
set.seed(1)
# generate observations for parking lot space
CASchools$PLS <- c(22 * CASchools$income
- 15 * CASchools$STR
+ 0.2 * CASchools$expenditure
+ rnorm(nrow(CASchools), sd = 80) + 3000)
```
```{r, fig.align='center',}
# plot parking lot space against test score
plot(CASchools$PLS,
CASchools$score,
xlab = "Parking Lot Space",
ylab = "Test Score",
pch = 20,
col = "steelblue")
# regress test score on PLS
summary(lm(score ~ PLS, data = CASchools))
```
$PLS$ is generated as a linear function of $expenditure$, $income$, $STR$ and a random disturbance. Therefore the data suggest that there is some positive relationship between parking lot space and test score. In fact, when estimating the model
\begin{align}
TestScore = \beta_0 + \beta_1 \times PLS + u. (\#eq:plsmod)
\end{align}
using `r ttcode("lm()")` we find that the coefficient on $PLS$ is positive and significantly different from zero. Also $R^2$ and $\bar{R}^2$ are about $0.3$ which is a lot more than the roughly $0.05$ observed when regressing the test scores on the class sizes only. This suggests that increasing the parking lot space boosts a school's test scores and that model \@ref(eq:plsmod) does even better in explaining heterogeneity in the dependent variable than a model with $STR$ as the only regressor. Keeping in mind how $PLS$ is constructed this comes as no surprise. It is evident that the high $R^2$ <it>cannot</it> be used to conclude that the estimated relation between parking lot space and test scores is causal: the (relatively) high $R^2$ is due to correlation between $PLS$ and other determinants and/or control variables. Increasing parking lot space is *not* an appropriate measure to generate more learning success!
## Analysis of the Test Score Data Set
Chapter \@ref(rmwmr) and some of the previous sections have stressed that it is important to include control variables in regression models if it is plausible that there are omitted factors. In our example of test scores, we want to estimate the causal effect of a change in the student-teacher ratio on test scores. We now provide an example of how to use multiple regression in order to alleviate omitted variable bias and demonstrate how to report this results using `r ttcode("R")`.
So far we have considered two variables that control for unobservable student characteristics which correlate with the student-teacher ratio *and* are assumed to have an impact on test scores:
+ $English$, the percentage of English learning students.
+ $lunch$, the share of students that qualify for a subsidized or even a free lunch at school.
Another new variable provided with `r ttcode("CASchools")` is `r ttcode("calworks")`, the percentage of students that qualify for the *CalWorks* income assistance program. Students eligible for *CalWorks* live in families with a total income below the threshold for the subsidized lunch program so both variables are indicators for the share of economically disadvantaged children. Both indicators are highly correlated.
```{r}
# estimate the correlation between 'calworks' and 'lunch'
cor(CASchools$calworks, CASchools$lunch)
```
There is no unambiguous way to proceed when deciding which variable to use. In any case it may not be a good idea to use both variables as regressors in view of collinearity. Therefore, we also consider alternative model specifications.
For a start, we plot student characteristics against test scores.
```{r,fig.width=8, fig.height=6}
# set up arrangement of plots
m <- rbind(c(1, 2), c(3, 0))
graphics::layout(mat = m)
# scatterplots
plot(score ~ english,
data = CASchools,
col = "steelblue",
pch = 20,
xlim = c(0, 100),
cex.main = 0.7,
xlab="English",
ylab="Score",
main = "Percentage of English language learners")
plot(score ~ lunch,
data = CASchools,
col = "steelblue",
pch = 20,
cex.main = 0.7,
xlab="Lunch",
ylab="Score",
main = "Percentage qualifying for reduced price lunch")
plot(score ~ calworks,
data = CASchools,
col = "steelblue",
pch = 20,
xlim = c(0, 100),
cex.main = 0.7,
xlab="CalWorks",
ylab="Score",
main = "Percentage qualifying for income assistance")
```
We divide the plotting area up using `r ttcode("layout()")`. The matrix `r ttcode("m")` specifies the location of the plots, see `?layout`.
We see that all relationships are negative. Here are the correlation coefficients.
```{r}
# estimate correlation between student characteristics and test scores
cor(CASchools$score, CASchools$english)
cor(CASchools$score, CASchools$lunch)
cor(CASchools$score, CASchools$calworks)
```
We shall consider five different model equations:
\begin{align*}
(I) \quad TestScore=& \, \beta_0 + \beta_1 \times STR + u, \\
(II) \quad TestScore=& \, \beta_0 + \beta_1 \times STR + \beta_2 \times english + u, \\
(III) \quad TestScore=& \, \beta_0 + \beta_1 \times STR + \beta_2 \times english + \beta_3 \times lunch + u, \\
(IV) \quad TestScore=& \, \beta_0 + \beta_1 \times STR + \beta_2 \times english + \beta_4 \times calworks + u, \\
(V) \quad TestScore=& \, \beta_0 + \beta_1 \times STR + \beta_2 \times english + \beta_3 \times lunch + \beta_4 \times calworks + u.
\end{align*}
The best way to communicate regression results is in a table. The `r ttcode("stargazer")` package is very convenient for this purpose. It provides a function that generates professionally looking HTML and LaTeX tables that satisfy scientific standards. One simply has to provide one or multiple object(s) of class `r ttcode("lm")`. The rest is done by the function `r ttcode("stargazer()")`.
```{r, eval=F}
# load the stargazer library
library(stargazer)
# estimate different model specifications
spec1 <- lm(score ~ STR, data = CASchools)
spec2 <- lm(score ~ STR + english, data = CASchools)
spec3 <- lm(score ~ STR + english + lunch, data = CASchools)
spec4 <- lm(score ~ STR + english + calworks, data = CASchools)
spec5 <- lm(score ~ STR + english + lunch + calworks, data = CASchools)
# gather robust standard errors in a list
rob_se <- list(sqrt(diag(vcovHC(spec1, type = "HC1"))),
sqrt(diag(vcovHC(spec2, type = "HC1"))),
sqrt(diag(vcovHC(spec3, type = "HC1"))),
sqrt(diag(vcovHC(spec4, type = "HC1"))),
sqrt(diag(vcovHC(spec5, type = "HC1"))))
# generate a LaTeX table using stargazer
stargazer(spec1, spec2, spec3, spec4, spec5,
se = rob_se,
digits = 3,
header = F,
column.labels = c("(I)", "(II)", "(III)", "(IV)", "(V)"))
```
<!--html_preserve-->
```{r, results='asis', echo=F, cache=T, message=FALSE, warning=FALSE, eval=my_output == "html"}
library(stargazer)
spec1 <- lm(score ~ STR, data = CASchools)
spec2 <- lm(score ~ STR + english, data = CASchools)
spec3 <- lm(score ~ STR + english + lunch, data = CASchools)
spec4 <- lm(score ~ STR + english + calworks, data = CASchools)
spec5 <- lm(score ~ STR + english + lunch + calworks, data = CASchools)
# gather robust standard errors in a list
rob_se <- list(
sqrt(diag(vcovHC(spec1, type = "HC1"))),
sqrt(diag(vcovHC(spec2, type = "HC1"))),
sqrt(diag(vcovHC(spec3, type = "HC1"))),
sqrt(diag(vcovHC(spec4, type = "HC1"))),
sqrt(diag(vcovHC(spec5, type = "HC1")))
)
stargazer(spec1, spec2, spec3, spec4, spec5,
type = "html",
se = rob_se,
header = F,
digits = 3,
float.env = "sidewaystable",
object.names = TRUE,
dep.var.caption = "Dependent Variable: Test Score",
model.numbers = FALSE,
column.labels = c("(I)", "(II)", "(III)", "(IV)", "(V)")
)
stargazer_html_title("Regressions of Test Scores on the Student-Teacher Ratio and Control Variables", "rotsostracv")
```
<!--/html_preserve-->
```{r, results='asis', echo=F, cache=T, message=FALSE, warning=FALSE, eval=my_output == "latex"}
library(stargazer)
spec1 <- lm(score ~ STR, data = CASchools)
spec2 <- lm(score ~ STR + english, data = CASchools)
spec3 <- lm(score ~ STR + english + lunch, data = CASchools)
spec4 <- lm(score ~ STR + english + calworks, data = CASchools)
spec5 <- lm(score ~ STR + english + lunch + calworks, data = CASchools)
# gather robust standard errors in a list
rob_se <- list(
sqrt(diag(vcovHC(spec1, type = "HC1"))),
sqrt(diag(vcovHC(spec2, type = "HC1"))),
sqrt(diag(vcovHC(spec3, type = "HC1"))),
sqrt(diag(vcovHC(spec4, type = "HC1"))),
sqrt(diag(vcovHC(spec5, type = "HC1")))
)
stargazer(spec1, spec2, spec3, spec4, spec5,
type = "latex",
title = "\\label{tab:rotsostracv} Regressions of Test Scores on the Student-Teacher Ratio and Control Variables",
float.env = "sidewaystable",
column.sep.width = "-10pt",
se = rob_se,
header = F,
digits = 3,
dep.var.caption = "Dependent Variable: Test Score",
object.names = TRUE,
model.numbers = FALSE,
column.labels = c("(I)", "(II)", "(III)", "(IV)", "(V)")
)
```
Table \@ref(tab:rotsostracv) states that $score$ is the dependent variable and that we consider five models. We see that the columns of Table \@ref(tab:rotsostracv) contain most of the information provided by `r ttcode("coeftest()")` and `r ttcode("summary()")` for the regression models under consideration: the coefficients estimates equipped with significance codes (the asterisks) and standard errors in parentheses below. Although there are no $t$-statistics, it is straightforward for the reader to compute them simply by dividing a coefficient estimate by the corresponding standard error. The bottom of the table reports summary statistics for each model and a legend. For an in-depth discussion of the tabular presentation of regression results, see Chapter 7.6 of the book.
What can we conclude from the model comparison?
1. We see that adding control variables roughly halves the coefficient on `r ttcode("STR")`. Also, the estimate is not sensitive to the set of control variables used. The conclusion is that decreasing the student-teacher ratio ceteris paribus by one unit leads to an estimated average increase in test scores of about $1$ point.
2. Adding student characteristics as controls increases $R^2$ and $\bar{R}^2$ from $0.049$ (`r ttcode("spec1")`) up to $0.773$ (`r ttcode("spec3")` and `r ttcode("spec5")`), so we can consider these variables as suitable predictors for test scores. Moreover, the estimated coefficients on all control variables are consistent with the impressions gained from Figure 7.2 of the book.
3. We see that the control variables are not statistically significant in all models. For example in `r ttcode("spec5")`, the coefficient on $calworks$ is not significantly different from zero at $5\%$ since $\lvert-0.048/0.059\rvert=0.81 < 1.64$. We also observe that the effect on the estimate (and its standard error) of the coefficient on $size$ of adding $calworks$ to the base specification `r ttcode("spec3")` is negligible. We can therefore consider `r ttcode("calworks")` as a superfluous control variable, given the inclusion of `r ttcode("lunch")` in this model.
## Exercises {#exercises-7}
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 1. Hypothesis Testing in a Multiple Regression Model --- $t$-statistics and $p$-values {-}
Reconsider the <tt>Boston</tt> data set and the following estimated model (homoscedasticity-only standard errors in parentheses) from the previous chapter:
$$\\widehat{medv}_i = \\underset{(0.75)}{32.828} -\\underset{(0.05)}{0.994} \\times lstat_i -\\underset{(0.04)}{0.083} \\times crim_i + \\underset{(0.01)}{0.038} \\times age_i.$$
Just as in the simple linear regression framework we can conduct hypothesis tests about the coefficients in multiple regression models. The most common hypothesis is $H_0:\\beta_j=0$ against the alternative $H_1:\\beta_j\\ne 0$ for some $j$ in $0,1,\\dots,k$.
The packages <tt>AER</tt> and <tt>MASS</tt> have been loaded. The coefficient estimates as well as the corresponding standard errors are available in <tt>coefs</tt> and <tt>SEs</tt>, respectively.
**Instructions:**
Use vector arithmetics to solve the following tasks:
+ Compute $t$-statistics for each coefficient by using the predefined objects <tt>coefs</tt> and <tt>SEs</tt>. Assign them to <tt>tstats</tt>.
+ Compute $p$-values for each coefficient and assign them to <tt>pval</tt>.
+ Check with the help of logical operators whether the hypotheses are rejected at the $1\\%$ significance level.
<iframe src="DCL/ex7_1.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
**Hints:**
+ The $t$-statistic for each coefficient is defined as $t=\\frac{\\widehat{\\beta}_j-\\beta_{j,0}}{SE(\\widehat{\\beta}_j)}$.
+ The $p$-value for a two-sided test using is computed as $2\\cdot\\Phi(-|t^{act}|)$ where $t^{act}$ denotes the computed $t$-statistic.
</div>') } else {
cat("\\begin{center}\\textit{This interactive part of the book is only available in the HTML version.}\\end{center}")
}
```
```{r, echo=F, purl=F, results='asis', eval=my_output == "html"}
cat('
<div class = "DCexercise">
#### 2. Hypothesis Testing in a Multiple Regression Model - Confidence Intervals {-}
Consider again the estimated model
$$\\widehat{medv}_i = \\underset{(0.75)}{32.828} -\\underset{(0.05)}{0.994} \\times lstat_i -\\underset{(0.04)}{0.083} \\times crim_i + \\underset{(0.01)}{0.038} \\times age_i.$$
which is available as the object <tt>mod</tt> in your working environment. The packages <tt>AER</tt> and <tt>MASS</tt> have been loaded.
**Instructions:**
+ Construct $99\\%$ confidence intervals for all model coefficients. Use the intervals to decide whether the individual null hypotheses $H_0:\\beta_j=0$, $j=0,1,2,3,4$ are rejected at the $1\\%$ level.
<iframe src="DCL/ex7_2.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
**Hint:**
+ You may use <tt>confint()</tt> to construct confidence intervals. The confidence level can be set via the argument <tt>level</tt>.
</div>')
```
```{r, echo=F, purl=F, results='asis', eval=my_output == "html"}
cat('
<div class = "DCexercise">
#### 3. Robust Hypothesis Testing in Multiple Regression Models {-}
The <tt>lm</tt> object <tt>mod</tt> from the previous exercises is available in your working environment. The packages <tt>AER</tt> and <tt>MASS</tt> have been loaded.
**Instructions:**
+ Print a coefficient summary that reports heteroscedasticity-robust standard errors.
+ Access entries of the matrix generated by <tt>coeftest()</tt> to check whether the hypotheses are rejected at a 1% significance level. Use logical operators <tt><,></tt>.
<iframe src="DCL/ex7_3.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
**Hints:**
+ Using the argument <tt>vcov.</tt> in <tt>coeftest()</tt> forces the function to use robust standard errors.
+ The $p$-values are contained in the fourth column of the output generated by <tt>coeftest()</tt>. Use square brackets to subset the matrix accordingly.
</div>')
```
```{r, echo=F, purl=F, results='asis', eval=my_output == "html"}
cat('
<div class = "DCexercise">
#### 4. Joint Hypothesis Testing --- $F$-Test I {-}
Sometimes we are interested in testing joint hypotheses which impose restrictions on *multiple* regression coefficients. For example, in the model
$$medv_i = \\beta_0 + \\beta_1\\times lstat_i + \\beta_2\\times crim_i + \\beta_3\\times age_i + u_i$$
we may test the null $H_0: \\beta_2=\\beta_3$ vs. the alternative $H_1: \\beta_2\\ne\\beta_3$ (which is a joint hypothesis as we impose a restriction on *two* regression coefficients).
The basic idea behind testing such a hypothesis is to conduct two regressions and to compare the outcomes: for one of the regressions we impose the restrictions formalized by the null (we call this the restricted regression model), whereas for the other regression the restriction is left out (we call this the unrestricted model). From this starting point we construct a test-statistic which, under the null, follows a well known distribution, an $F$ distribution (see the next exercise).
However, in this exercise we start with the initial computations necessary to construct the test statistic.
The packages <tt>AER</tt> and <tt>MASS</tt> have been loaded.
**Instructions:**
+ Estimate the restricted model, that is, the model where the restriction formalized by $H_0: \\beta_2=\\beta_3$ is assumed to be true. Save the model in <tt>model_res</tt>.
+ Compute the $SSR$ of the restricted model and assign it to <tt>RSSR</tt>.
+ Estimate the unrestricted model, that is, the model where the restriction is assumed to be false. Save it in <tt>model_unres</tt>.
+ Compute the $SSR$ of the unrestricted model and assign it to <tt>USSR</tt>.
<iframe src="DCL/ex7_4.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
**Hints:**
+ The restricted model can be written as $$medv_i = \\beta_0 + \\beta_1\\times lstat_i + \\beta_2\\times crim_i + \\beta_2\\times age_i + u_i$$ which, after rearranging, can be expressed as $$medv_i = \\beta_0 + \\beta_1\\times lstat_i + \\beta_2\\times(crim_i+age_i) + u_i.$$
+ The $SSR$ is defined as the sum of the squared residuals.
+ Note that the residuals of a regression model are available as <tt>residuals</tt> in the corresponding <tt>lm</tt> object. So you can access them as usual via the <tt>$</tt>-operator.
</div>
<div class = "DCexercise">')
```
```{r, echo=F, purl=F, results='asis', eval=my_output == "html"}
cat('
#### 5. Joint Hypothesis Testing --- F-Test II {-}
After estimating the models and computing the $SSR$s you now have to compute the test-statistic and conduct the $F$-test. As mentioned in the last exercise, the test-statistic follows an $F$ distribution. More precisely, we deal with the $F_{q,n-k-1}$ distribution where $q$ denotes the number of restrictions under the null and $k$ is the of regressors in the unrestricted model, excluding the intercept.
The packages <tt>AER</tt> and <tt>MASS</tt> have been loaded. Both models (<tt>model_res</tt> and <tt>model_unres</tt>) as well as their SSR (<tt>RSSR</tt> and <tt>USSR</tt>) are available in your working environment.
**Instructions:**
+ Compute the $F$-statistic and assign it to <tt>Fstat</tt>.
+ Compute the $p$-value and assign it to <tt>pval</tt>.
+ Check whether the null is rejected at the $1\\%$ level using logical operators.
+ Verify your result by using <tt>linearHypothesis()</tt> and printing the results.
<iframe src="DCL/ex7_5.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
**Hints:**
+ The $F$-statistic is defined as $\\frac{(RSSR-USSR)/q}{USSR/(n-k-1)}$.
+ The $p$-value can be computed as $1-F_{q,n-k-1}(F^{act})$ where $F_{q,n-k-1}$ denotes the CDF of the $F$-distribution (<tt>pf()</tt>) with degrees of freedom $q$ and $n-k-1$ and $F^{act}$ the computed $F$-statistic.
+ <tt>linearHypothesis()</tt> expects the unrestricted model as well as the null hypothesis as arguments.
</div>')
```
```{r, echo=F, purl=F, results='asis', eval=my_output == "html"}
cat('
<div class = "DCexercise">
#### 6. Joint Hypothesis Testing - Confidence Set {-}
As you know from previous chapters constructing a confidence set for a single regression coefficient results in a simple confidence interval on the real line. However if we consider $n$ regression coefficients jointly (as we do in a joint hypothesis testing setting) we move from $\\mathbb{R}$ to $\\mathbb{R}^n$ resulting in a n-dimensional confidence set. For the sake of illustration we then often choose $n=2$, so that we end up with a representable two-dimensional plane.
Recall the estimated model
$$\\widehat{medv}_i = \\underset{(0.75)}{32.828} -\\underset{(0.05)}{0.994} \\times lstat_i -\\underset{(0.04)}{0.083} \\times crim_i + \\underset{(0.01)}{0.038} \\times age_i.$$
which is available as <tt>mod</tt> in your working environment. Assume you want to test the null $H_0: \\beta_2=\\beta_3=0$ vs. $H_1: \\beta_2\\ne 0$ or $\\beta_3\\ne 0$.
The packages <tt>AER</tt> and <tt>MASS</tt> have been loaded.
**Instructions:**
+ Construct a $99\\%$ confidence set for the coefficients of <tt>crim</tt> and <tt>lstat</tt>, that is a two-dimensional confidence set. Can you reject the null stated above?
+ Verify your visual inspection by conducting a corresponding $F$-test.
<iframe src="DCL/ex7_6.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
**Hints:**
+ Use <tt>confidenceEllipse()</tt> to construct a two-dimensional confidence set. Besides the coefficients for which the confidence set shall be constructed (<tt>which.coef</tt>), you have to specify the confidence level (<tt>levels</tt>).
+ As usual you can use <tt>linearHypothesis()</tt> to conduct the $F$-test. Note that there are two restrictions now, hence you have to pass a vector containing both restrictions.
</div>')
```