-
Notifications
You must be signed in to change notification settings - Fork 272
/
Copy path06-ch6.Rmd
819 lines (550 loc) · 52.3 KB
/
06-ch6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
# Regression Models with Multiple Regressors {#rmwmr}
In what follows we introduce linear regression models that use more than just one explanatory variable and discuss important key concepts in multiple regression. As we broaden our scope beyond the relationship of only two variables (the dependent variable and a single regressor) some potential new issues arise such as *multicollinearity* and *omitted variable bias* (OVB). In particular, this chapter deals with omitted variables and its implication for causal interpretation of OLS-estimated coefficients.
Naturally, we will discuss estimation of multiple regression models using `r ttcode("R")`. We will also illustrate the importance of thoughtful usage of multiple regression models via simulation studies that demonstrate the consequences of using highly correlated regressors or misspecified models.
The packages `r ttcode("AER")` [@R-AER] and `r ttcode("MASS")` [@R-MASS] are needed for reproducing the code presented in this chapter. Make sure that the following code chunk executes without any errors.
```{r, warning=FALSE, message=FALSE, eval=FALSE}
library(AER)
library(MASS)
```
## Omitted Variable Bias
The previous analysis of the relationship between test score and class size discussed in Chapters \@ref(lrwor) and \@ref(htaciitslrm) has a major flaw: we ignored other determinants of the dependent variable (test score) that correlate with the regressor (class size). Remember that influences on the dependent variable which are not captured by the model are collected in the error term, which we so far assumed to be uncorrelated with the regressor. However, this assumption is violated if we exclude determinants of the dependent variable which vary with the regressor. This might induce an estimation bias, i.e., the mean of the OLS estimator's sampling distribution is no longer equals to the true mean. In our example we therefore wrongly estimate the causal effect on test scores of a unit change in the student-teacher ratio, on average. This issue is called *omitted variable bias* (OVB) and is summarized by Key Concept 6.1.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC6.1">
<h3 class = "right"> Key Concept 6.1 </h3>
<h3 class = "left"> Omitted Variable Bias in Regression with a Single Regressor </h3>
Omitted variable bias is the bias in the OLS estimator that arises when the regressor, $X$, is *correlated* with an omitted variable. For omitted variable bias to occur, two conditions must be fulfilled:
1. $X$ is correlated with the omitted variable.
2. The omitted variable is a determinant of the dependent variable $Y$.
Together, 1. and 2. result in a violation of the first OLS assumption $E(u_i\\vert X_i) = 0$. Formally, the resulting bias can be expressed as
$$ \\hat\\beta_1 \\xrightarrow[]{p} \\beta_1 + \\rho_{Xu} \\frac{\\sigma_u}{\\sigma_X}. \\tag{6.1} $$
See Appendix 6.1 of the book for a detailed derivation. (<a href="#mjx-eqn-6.1">6.1</a>) states that OVB is a problem that cannot be solved by increasing the number of observations used to estimate $\\beta_1$, as $\\hat\\beta_1$ is inconsistent: OVB prevents the estimator from converging in probability to the true parameter value. Strength and direction of the bias are determined by $\\rho_{Xu}$, the correlation between the error term and the regressor.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Omitted Variable Bias in Regression with a Single Regressor]{6.1}
Omitted variable bias is the bias in the OLS estimator that arises when the regressor, $X$, is \\textit{correlated} with an omitted variable. For omitted variable bias to occur, two conditions must be fulfilled:\\newline
\\begin{enumerate}
\\item $X$ is correlated with the omitted variable.
\\item The omitted variable is a determinant of the dependent variable $Y$.
\\end{enumerate}\\vspace{0.5cm}
Together, 1. and 2. result in a violation of the first OLS assumption $E(u_i\\vert X_i) = 0$. Formally, the resulting bias can be expressed as
\\begin{align}
\\hat\\beta_1 \\xrightarrow[]{p} \\beta_1 + \\rho_{Xu} \\frac{\\sigma_u}{\\sigma_X}.
\\end{align}
See Appendix 6.1 of the book for a detailed derivation. (6.1) states that OVB is a problem that cannot be solved by increasing the number of observations used to estimate $\\beta_1$, as $\\hat\\beta_1$ is inconsistent: OVB prevents the estimator from converging in probability to the true parameter value. Strength and direction of the bias are determined by $\\rho_{Xu}$, the correlation between the error term and the regressor.
\\end{keyconcepts}
')
```
In the example of test score and class size, it is easy to come up with variables that may cause such a bias, if omitted from the model. As mentioned in the book, a highly relevant variable could be the percentage of English learners in the school district: it is plausible that the ability to speak, read and write English is an important factor for successful learning. Therefore, students that are still learning English are likely to perform worse in tests than native speakers. Also, it is conceivable that the share of English learning students is bigger in school districts where class sizes are relatively large: think of poor urban districts where a lot of immigrants live. <br>
Let us think about a possible bias induced by omitting the share of English learning students ($PctEL$) in view of (6.1). When the estimated regression model does not include $PctEL$ as a regressor although the true data generating process (DGP) is
$$ TestScore = \beta_0 + \beta_1 \times STR + \beta_2 \times PctEL \tag{6.2}$$
where $STR$ and $PctEL$ are correlated, we have
$$\rho_{STR,PctEL}\neq0.$$
Let us investigate this using `r ttcode("R")`. After defining our variables we may compute the correlation between $STR$ and $PctEL$ as well as the correlation between $STR$ and $TestScore$.
```{r, message=F, warnings=F}
# load the AER package
library(AER)
# load the data set
data(CASchools)
# define variables
CASchools$STR <- CASchools$students/CASchools$teachers
CASchools$score <- (CASchools$read + CASchools$math)/2
# compute correlations
cor(CASchools$STR, CASchools$score)
cor(CASchools$STR, CASchools$english)
```
The fact that $\widehat{\rho}_{STR, Testscore} = -0.2264$ is cause for concern that Omitting $PctEL$ leads to a negatively biased estimate $\hat\beta_1$ which is $\widehat{\rho}_{STR, Testscore} = -0.2264$ since this indicates that $\rho_{Xu} < 0$. As a consequence we expect $\hat\beta_1$, the coefficient on $STR$, to be too large in absolute value. Put differently, the OLS estimate of $\hat\beta_1$ suggests that small classes improve test scores, but that the effect of small classes is overestimated as it captures the effect of having fewer English learners, too.
What happens to the magnitude of $\hat\beta_1$ if we add the variable $PctEL$ to the regression, that is, if we estimate the model
$$ TestScore = \beta_0 + \beta_1 \times STR + \beta_2 \times PctEL + u $$
instead? And what do we expect about the sign of $\hat\beta_2$, the estimated coefficient on $PctEL$? Following the reasoning above we should still end up with a negative but larger coefficient estimate $\hat\beta_1$ than before and a negative estimate $\hat\beta_2$.
Let us estimate both regression models and compare. Performing a multiple regression in `r ttcode("R")` is straightforward. One can simply add additional variables to the right hand side of the `formula` argument of the function `r ttcode("lm()")` by using their names and the `r ttcode("+")` operator.
```{r}
# estimate both regression models
mod <- lm(score ~ STR, data = CASchools)
mult.mod <- lm(score ~ STR + english, data = CASchools)
# print the results to the console
mod
mult.mod
```
We find the outcomes to be consistent with our expectations.
The following section discusses some theory on multiple regression models.
## The Multiple Regression Model {#tmrm}
The multiple regression model extends the basic concept of the simple regression model discussed in chapters \@ref(lrwor) and \@ref(htaciitslrm). A multiple regression model enables us to estimate the effect on $Y_i$ of changing a regressor $X_{1i}$ if the remaining regressors $X_{2i},X_{3i}\dots,X_{ki}$ *do not vary*. In fact we already have performed estimation of the multiple regression model (<a href="#mjx-eqn-6.2">6.2</a>) using `r ttcode("R")` in the previous section. The interpretation of the coefficient on student-teacher ratio is the effect on test scores of a one unit change student-teacher ratio if the percentage of English learners is kept constant.
Just like in the simple regression model, we assume the true relationship between $Y$ and $X_{1i},X_{2i}\dots\dots,X_{ki}$ to be linear. On average, this relation is given by the population regression function
$$ E(Y_i\vert X_{1i}=x_1, X_{2i}=x_2, X_{3i}=x_3,\dots, X_{ki}=x_k) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \dots + \beta_k x_k. \tag{6.3} $$
As in the simple regression model, the relation $$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \dots + \beta_k X_{ki}$$ does not hold exactly since there are disturbing influences to the dependent variable $Y$ we cannot observe as explanatory variables. Therefore we add an error term $u$ which represents deviations of the observations from the population regression line to (<a href="#mjx-eqn-6.3">6.3</a>). This yields the population multiple regression model $$ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \dots + \beta_k X_{ki} + u_i, \ i=1,\dots,n. \tag{6.4} $$
Key Concept 6.2 summarizes the core concepts of the multiple regression model.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC6.2">
<h3 class = "right"> Key Concept 6.2 </h3>
<h3 class = "left"> The Multiple Regression Model </h3>
The multiple regression model is
$$ Y_i = \\beta_0 + \\beta_1 X_{1i} + \\beta_2 X_{2i} + \\beta_3 X_{3i} + \\dots + \\beta_k X_{ki} + u_i \\ \\ , \\ \\ i=1,\\dots,n. $$
The designations are similar to those in the simple regression model:
- $Y_i$ is the $i^{th}$ observation in the dependent variable. Observations on the $k$ regressors are denoted by $X_{1i},X_{2i},\\dots,X_{ki}$ and $u_i$ is the error term.
- The average relationship between $Y$ and the regressors is given by the population regression line
$$ E(Y_i\\vert X_{1i}=x_1, X_{2i}=x_2, X_{3i}=x_3,\\dots, X_{ki}=x_k) = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 + \\beta_3 x_3 + \\dots + \\beta_k x_k. $$
- $\\beta_0$ is the intercept; it is the expected value of $Y$ when all $X$s equal $0$. $\\beta_j \\ , \\ j=1,\\dots,k$ are the coefficients on $X_j \\ , \\ j=1,\\dots,k$. $\\beta_1$ measures the expected change in $Y_i$ that results from a one unit change in $X_{1i}$ while holding all other regressors constant.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[The Multiple Regression Model]{6.2}
\\begin{itemize}
\\item $Y_i$ is the $i^{th}$ observation in the dependent variable. Observations on the $k$ regressors are denoted by $X_{1i},X_{2i},\\dots,X_{ki}$ and $u_i$ is the error term.
\\item The average relationship between $Y$ and the regressors is given by the population regression line
$$ E(Y_i\\vert X_{1i}=x_1, X_{2i}=x_2, X_{3i}=x_3,\\dots, X_{ki}=x_k) = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 + \\beta_3 x_3 + \\dots + \\beta_k x_k. $$
\\item $\\beta_0$ is the intercept; it is the expected value of $Y$ when all $X$s equal $0$. $\\beta_j \\ , \\ j=1,\\dots,k$ are the coefficients on $X_j \\ , \\ j=1,\\dots,k$. $\\beta_1$ measures the expected change in $Y_i$ that results from a one unit change in $X_{1i}$ while holding all other regressors constant.
\\end{itemize}
\\end{keyconcepts}
')
```
How can we estimate the coefficients of the multiple regression model (<a href="#mjx-eqn-6.4">6.4</a>)? We will not go too much into detail on this issue as our focus is on using `r ttcode("R")`. However, it should be pointed out that, similarly to the simple regression model, the coefficients of the multiple regression model can be estimated using OLS. As in the simple model, we seek to minimize the sum of squared mistakes by choosing estimates $b_0,b_1,\dots,b_k$ for the coefficients $\beta_0,\beta_1,\dots,\beta_k$ such that
$$\sum_{i=1}^n (Y_i - b_0 - b_1 X_{1i} - b_2 X_{2i} - \dots - b_k X_{ki})^2 \tag{6.5}$$
is minimized. Note that (<a href="#mjx-eqn-6.5">6.5</a>) is simply an extension of $SSR$ in the case with just one regressor and a constant. The estimators that minimize (<a href="#mjx-eqn-6.5">6.5</a>) are hence denoted by $\hat\beta_0,\hat\beta_1,\dots,\hat\beta_k$ and, as in the simple regression model, we call them the ordinary least squares estimators of $\beta_0,\beta_1,\dots,\beta_k$. For the predicted value of $Y_i$ given the regressors and the estimates $\hat\beta_0,\hat\beta_1,\dots,\hat\beta_k$ we have
$$ \hat{Y}_i = \hat\beta_0 + \hat\beta_1 X_{1i} + \dots +\hat\beta_k X_{ki}. $$
The difference between $Y_i$ and its predicted value $\hat{Y}_i$ is called the OLS residual of observation $i$: $\hat{u} = Y_i - \hat{Y}_i$.
For further information regarding the theory behind multiple regression, see Chapter 18.1 of the book which inter alia presents a derivation of the OLS estimator in the multiple regression model using matrix notation.
Now let us jump back to the example of test scores and class sizes. The estimated model object is `mult.mod`. As for simple regression models we can use `r ttcode("summary()")` to obtain information on estimated coefficients and model statistics.
```{r}
summary(mult.mod)$coef
```
So the estimated multiple regression model is
$$ \widehat{TestScore} = 686.03 - 1.10 \times STR - 0.65 \times PctEL \tag{6.6}. $$
Unlike in the simple regression model where the data can be represented by points in the two-dimensional coordinate system, we now have three dimensions. Hence observations can be represented by points in three-dimensional space. Therefore (<a href="#mjx-eqn-6.6">6.6</a>) is now no longer a regression line but a *regression plane*. This idea extends to higher dimensions when we further expand the number of regressors $k$. We then say that the regression model can be represented by a hyperplane in the $k+1$ dimensional space. It is already hard to imagine such a space if $k=3$ and we best stick with the general idea that, in the multiple regression model, the dependent variable is explained by a *linear combination of the regressors*. However, in the present case we are able to visualize the situation. The following figure is an interactive 3D visualization of the data and the estimated regression plane (<a href="#mjx-eqn-6.6">6.6</a>).
<center>
```{r plotlyfig, echo=F, fig.height=5, fig.width=9, warning=F, message=F}
if(knitr::opts_knit$get("rmarkdown.pandoc.to") == "html"){
library(plotly)
y <- CASchools$score
x1 <- CASchools$STR
x2 <- CASchools$english
df <- data.frame(y, x1, x2)
reg <- lm(y ~ x1 + x2)
cf.mod <- coef(reg)
x1.seq <- seq(min(x1),max(x1),length.out=25)
x2.seq <- seq(min(x2),max(x2),length.out=25)
z <- t(outer(x1.seq, x2.seq, function(x,y) cf.mod[1] + cf.mod[2]*x + cf.mod[3]*y))
rbPal <- colorRampPalette(c('red','blue'))
cols <- rbPal(10)[as.numeric(cut(abs(y-reg$fitted.values), breaks = 10))]
m <- list(
t = 5
)
p <- plot_ly(x=x1.seq,
y=x2.seq,
z=z,
colors = "red",
opacity = 0.9,
name = "Reg.Plane",
type="surface"
) %>%
add_trace(data=df, name='CASchools', x=x1, y=x2, z=y, mode="markers", type="scatter3d",
marker = list(color=cols, opacity=0.85, symbol=105, size=4)
) %>%
hide_colorbar() %>%
layout(
margin = m,
showlegend = FALSE,
scene = list(
aspectmode = "manual", aspectratio = list(x=1, y=1.3, z=1),
xaxis = list(title = "STR"),
yaxis = list(title = "PctEL"),
zaxis = list(title = "TestScore"),
camera = list(eye = list(x = -2,y = -0.1, z=0.05),
center = list(x = 0,
y = 0,
z = 0
)
)
)
)
p <- p %>% config(showLink = F, displayModeBar = F);p
}
```
</center>
```{r, echo=F, purl=F, eval=my_output=="latex", results='asis'}
cat('\\begin{center}\\textit{This interactive part of the book is only available in the HTML version.}\\end{center}')
```
We observe that the estimated regression plane fits the data reasonably well --- at least with regard to the shape and spatial position of the points. The color of the markers is an indicator for the absolute deviation from the predicted regression plane. Observations that are colored more reddish lie close to the regression plane while the color shifts to blue with growing distance. An anomaly that can be seen from the plot is that there might be heteroskedasticity: we see that the dispersion of regression errors made, i.e., the distance of observations to the regression plane tends to decrease as the share of English learning students increases.
## Measures of Fit in Multiple Regression {#mofimr}
In multiple regression, common summary statistics are $SER$, $R^2$ and the adjusted $R^2$.
Taking the code from Section \@ref(tmrm), simply use `summary(mult.mod)` to obtain the $SER$, $R^2$ and adjusted-$R^2$. For multiple regression models the $SER$ is computed as
$$ SER = s_{\hat u} = \sqrt{s_{\hat u}^2}, $$
where modify the denominator of the premultiplied factor in $s_{\hat u}^2$ in order to accommodate for additional regressors. Thus,
$$ s_{\hat u}^2 = \frac{1}{n-k-1} \, SSR $$
with $k$ denoting the number of regressors *excluding* the intercept.
While `r ttcode("summary()")` computes the $R^2$ just as in the case of a single regressor, it is no reliable measure for multiple regression models. This is due to $R^2$ increasing whenever an additional regressor is added to the model. Adding a regressor decreases the $SSR$ --- at least unless the respective estimated coefficient is exactly zero, which practically never happens (see Chapter 6.4 of the book for a detailed argument). The adjusted $R^2$ takes this into consideration by "punishing" the addition of regressors using a correction factor. So the adjusted $R^2$, or simply $\bar{R}^2$, is a modified version of $R^2$. It is defined as
$$ \bar{R}^2 = 1-\frac{n-1}{n-k-1} \, \frac{SSR}{TSS}. $$
As you may have already suspected, `r ttcode("summary()")` adjusts the formula for $SER$ and it computes $\bar{R}^2$ and of course $R^2$ by default, thereby leaving the decision of which measure to rely on to the user.
You can find both measures at the bottom of the output produced by calling `summary(mult.mod)`.
```{r}
summary(mult.mod)
```
We can also compute the measures by hand using the formulas above. Let us check that the results coincide with the values provided by `r ttcode("summary()")`.
```{r}
# define the components
n <- nrow(CASchools) # number of observations (rows)
k <- 2 # number of regressors
y_mean <- mean(CASchools$score) # mean of avg. test-scores
SSR <- sum(residuals(mult.mod)^2) # sum of squared residuals
TSS <- sum((CASchools$score - y_mean )^2) # total sum of squares
ESS <- sum((fitted(mult.mod) - y_mean)^2) # explained sum of squares
# compute the measures
SER <- sqrt(1/(n-k-1) * SSR) # standard error of the regression
Rsq <- 1 - (SSR / TSS) # R^2
adj_Rsq <- 1 - (n-1)/(n-k-1) * SSR/TSS # adj. R^2
# print the measures to the console
c("SER" = SER, "R2" = Rsq, "Adj.R2" = adj_Rsq)
```
Now, what can we say about the fit of our multiple regression model for test scores with the percentage of English learners as an additional regressor? Does it improve on the simple model including only an intercept and a measure of class size? The answer is yes: compare $\bar{R}^2$ with that obtained for the simple regression model `r ttcode("mod")`.
Including $PctEL$ as a regressor improves the $\bar{R}^2$, which we deem to be more reliable in view of the above discussion. Notice that the difference between $R^2$ and $\bar{R}^2$ is small since $k=2$ and $n$ is large. In short, the fit of (<a href="#mjx-eqn-6.6">6.6</a>) improves vastly on the fit of the simple regression model with $STR$ as the only regressor.<br>
Comparing regression errors we find that the precision of the multiple regression model (<a href="#mjx-eqn-6.6">6.6</a>) improves upon the simple model as adding $PctEL$ lowers the $SER$ from $18.6$ to $14.5$ units of test score.
As already mentioned, $\bar{R}^2$ may be used to quantify how good a model fits the data. However, it is rarely a good idea to maximize these measures by stuffing the model with regressors. You will not find any serious study that does so. Instead, it is more useful to include regressors that improve the estimation of the causal effect of interest which is *not* assessed well by $R^2$. The issue of variable selection is covered in Chapter \@ref(nrf).
## OLS Assumptions in Multiple Regression
In the multiple regression model we extend the three least squares assumptions of the simple regression model (see Chapter \@ref(lrwor)) and add a fourth assumption. These assumptions are presented in Key Concept 6.4. We will not go into the details of assumptions 1-3 since their ideas generalize easy to the case of multiple regressors. We will focus on the fourth assumption. This assumption rules out perfect correlation between regressors.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC6.4">
<h3 class = "right"> Key Concept 6.4 </h3>
<h3 class = "left"> The Least Squares Assumptions in the Multiple Regression Model </h3>
The multiple regression model is given by
$$ Y_i = \\beta_0 + \\beta_1 X_{1i} + \\beta_2 X_{2i} + \\dots + \\beta_k X_{ki} + u_i \\ , \\ i=1,\\dots,n. $$
The OLS assumptions in the multiple regression model are an extension of the ones made for the simple regression model:
1. Regressors $(X_{1i}, X_{2i}, \\dots, X_{ki}, Y_i) \\ , \\ i=1,\\dots,n$, are drawn such that the i.i.d. assumption holds.
2. $u_i$ is an error term with conditional mean zero given the regressors, i.e.,
$$ E(u_i\\vert X_{1i}, X_{2i}, \\dots, X_{ki}) = 0. $$
3. Large outliers are unlikely, formally $X_{1i},\\dots,X_{ki}$ and $Y_i$ have finite fourth moments.
4. No perfect multicollinearity.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[The Least Squares Assumptions in the Multiple Regression Model]{6.4}
The multiple regression model is given by
$$ Y_i = \\beta_0 + \\beta_1 X_{1i} + \\beta_2 X_{2i} + \\dots + \\beta_k X_{ki} + u_i \\ , \\ i=1,\\dots,n. $$
The OLS assumptions in the multiple regression model are an extension of the ones made for the simple regression model:\\newline
\\begin{enumerate}
\\item Regressors $(X_{1i}, X_{2i}, \\dots, X_{ki}, Y_i) \\ , \\ i=1,\\dots,n$, are drawn such that the i.i.d. assumption holds.
\\item $u_i$ is an error term with conditional mean zero given the regressors, i.e.,
$$ E(u_i\\vert X_{1i}, X_{2i}, \\dots, X_{ki}) = 0. $$
\\item Large outliers are unlikely, formally $X_{1i},\\dots,X_{ki}$ and $Y_i$ have finite fourth moments.
\\item No perfect multicollinearity.
\\end{enumerate}
\\end{keyconcepts}
')
```
### Multicollinearity {-}
*Multicollinearity* means that two or more regressors in a multiple regression model are *strongly* correlated. If the correlation between two or more regressors is perfect, that is, one regressor can be written as a linear combination of the other(s), we have *perfect multicollinearity*. While strong multicollinearity in general is unpleasant as it causes the variance of the OLS estimator to be large (we will discuss this in more detail later), the presence of perfect multicollinearity makes it impossible to solve for the OLS estimator, i.e., the model cannot be estimated in the first place.
The next section presents some examples of perfect multicollinearity and demonstrates how `r ttcode("lm()")` deals with them.
#### Examples of Perfect Multicollinearity {-}
How does `r ttcode("R")` react if we try to estimate a model with perfectly correlated regressors?
`r ttcode("lm")` will produce a warning in the first line of the coefficient section of the output (`r ttcode("1 not defined because of singularities")`) and ignore the regressor(s) which is (are) assumed to be a linear combination of the other(s). Consider the following example where we add another variable `r ttcode("FracEL")`, the fraction of English learners, to `r ttcode("CASchools")` where observations are scaled values of the observations for `r ttcode("english")` and use it as a regressor together with `r ttcode("STR")` and `r ttcode("english")` in a multiple regression model. In this example `r ttcode("english")` and `r ttcode("FracEL")` are perfectly collinear. The `r ttcode("R")` code is as follows.
```{r}
# define the fraction of English learners
CASchools$FracEL <- CASchools$english / 100
# estimate the model
mult.mod <- lm(score ~ STR + english + FracEL, data = CASchools)
# obtain a summary of the model
summary(mult.mod)
```
The row `r ttcode("FracEL")` in the coefficients section of the output consists of `r ttcode("NA")` entries since `r ttcode("FracEL")` was excluded from the model.
If we were to compute OLS by hand, we would run into the same problem but no one would be helping us out! The computation simply fails. Why is this? Take the following example:
Assume you want to estimate a simple linear regression model with a constant and a single regressor $X$. As mentioned above, for perfect multicollinearity to be present $X$ has to be a linear combination of the other regressors. Since the only other regressor is a constant (think of the right hand side of the model equation as $\beta_0 \times 1 + \beta_1 X_i + u_i$ so that $\beta_1$ is always multiplied by $1$ for every observation), $X$ has to be constant as well. For $\hat\beta_1$ we have
\[ \hat\beta_1 = \frac{\sum_{i = 1}^n (X_i - \bar{X})(Y_i - \bar{Y})} { \sum_{i=1}^n (X_i - \bar{X})^2} = \frac{\widehat{Cov}(X,Y)}{\widehat{Var}(X)}. \tag{6.7} \]
The variance of the regressor $X$ is in the denominator. Since the variance of a constant is zero, we are not able to compute this fraction and $\hat{\beta}_1$ is undefined.
**Note:** In this special case the denominator in (<a href="#mjx-eqn-6.7">6.7</a>) equals zero, too. Can you show that?
Let us consider two further examples where our selection of regressors induces perfect multicollinearity. First, assume that we intend to analyze the effect of class size on test score by using a dummy variable that identifies classes which are not small ($NS$). We define that a school has the $NS$ attribute when the school's average student-teacher ratio is at least $12$,
$$ NS = \begin{cases} 0, \ \ \ \text{if STR < 12} \\ 1 \ \ \ \text{otherwise.} \end{cases} $$
We add the corresponding column to `r ttcode("CASchools")` and estimate a multiple regression model with covariates `r ttcode("computer")` and `r ttcode("english")`.
```{r}
# if STR smaller 12, NS = 0, else NS = 1
CASchools$NS <- ifelse(CASchools$STR < 12, 0, 1)
# estimate the model
mult.mod <- lm(score ~ computer + english + NS, data = CASchools)
# obtain a model summary
summary(mult.mod)
```
Again, the output of `r ttcode("summary(mult.mod)")` tells us that inclusion of `r ttcode("NS")` in the regression would render the estimation infeasible. What happened here? This is an example where we made a logical mistake when defining the regressor `r ttcode("NS")`: taking a closer look at $NS$, the redefined measure for class size, reveals that there is not a single school with $STR<12$ hence $NS$ equals one for all observations. We can check this by printing the contents of `r ttcode("CASchools$NS")` or by using the function `r ttcode("table()")`, see `?table`.
```{r}
table(CASchools$NS)
```
`r ttcode("CASchools$NS")` is a vector of $420$ ones and our data set includes $420$ observations. This obviously violates assumption 4 of Key Concept 6.4: the observations for the intercept are always $1$,
\begin{align*}
intercept = \, & \lambda \cdot NS.
\end{align*}
\begin{align*}
\begin{pmatrix} 1
\\ \vdots \\ 1
\end{pmatrix} = \, & \lambda \cdot
\begin{pmatrix} 1 \\
\vdots \\ 1
\end{pmatrix} \\
\Leftrightarrow \, & \lambda = 1.
\end{align*}
Since the regressors can be written as a linear combination of each other, we face perfect multicollinearity and `r ttcode("R")` excludes `r ttcode("NS")` from the model. Thus the take-away message is: think carefully about how the regressors in your models relate to each other!
Another example of perfect multicollinearity is known as the *dummy variable trap*. This may occur when multiple dummy variables are used as regressors. A common case for this is when dummies are used to sort the data into mutually exclusive categories. For example, suppose we have spatial information that indicates whether a school is located in the North, West, South or East of the U.S. This allows us to create the dummy variables
\begin{align*}
North_i =&
\begin{cases}
1 \ \ \text{if located in the north} \\
0 \ \ \text{otherwise}
\end{cases} \\
West_i =&
\begin{cases}
1 \ \ \text{if located in the west} \\
0 \ \ \text{otherwise}
\end{cases} \\
South_i =&
\begin{cases}
1 \ \ \text{if located in the south} \\
0 \ \ \text{otherwise}
\end{cases} \\
East_i =&
\begin{cases}
1 \ \ \text{if located in the east} \\
0 \ \ \text{otherwise}.
\end{cases}
\end{align*}
Since the regions are mutually exclusive, for every school $i=1,\dots,n$ we have $$ North_i + West_i + South_i + East_i = 1. $$
We run into problems when trying to estimate a model that includes a constant and *all four* direction dummies in the model, e.g., $$TestScore = \beta_0 + \beta_1 \times STR + \beta_2 \times english + \beta_3 \times North_i + \beta_4 \times West_i + \beta_5 \times South_i +\\ \beta_6 \times East_i + u_i \tag{6.8}$$
since then for all observations $i=1,\dots,n$ the constant term is a linear combination of the dummies:
\begin{align}
intercept = \, & \lambda_1 \cdot (North + West + South + East) \\
\begin{pmatrix} 1 \\ \vdots \\ 1\end{pmatrix} = \, & \lambda_1 \cdot \begin{pmatrix} 1 \\ \vdots \\ 1\end{pmatrix} \\ \Leftrightarrow \, & \lambda_1 = 1
\end{align}
and we have perfect multicollinearity. Thus the "dummy variable trap" means not paying attention and falsely including exhaustive dummies *and* a constant in a regression model.
How does `r ttcode("lm()")` handle a regression like (<a href="#mjx-eqn-6.8">6.8</a>)? Let us first generate some artificial categorical data and append a new column named `r ttcode("directions")` to `r ttcode("CASchools")` and see how `r ttcode("lm()")` behaves when asked to estimate the model.
```{r}
# set seed for reproducibility
set.seed(1)
# generate artificial data on location
CASchools$direction <- sample(c("West", "North", "South", "East"),
420,
replace = T)
# estimate the model
mult.mod <- lm(score ~ STR + english + direction, data = CASchools)
# obtain a model summary
summary(mult.mod)
```
Notice that `r ttcode("R")` solves the problem on its own by generating and including the dummies `r ttcode("directionNorth")`, `r ttcode("directionSouth")` and `r ttcode("directionWest")` but omitting `r ttcode("directionEast")`. Of course, the omission of every other dummy instead would achieve the same. Another solution would be to exclude the constant and to include all dummies instead.
Does this mean that the information on schools located in the East is lost? Fortunately, this is not the case: exclusion of `r ttcode("directEast")` just alters the interpretation of coefficient estimates on the remaining dummies from absolute to relative. For example, the coefficient estimate on `r ttcode("directionNorth")` states that, on average, test scores in the North are about $1.61$ points higher than in the East.
A last example considers the case where a perfect linear relationship arises from redundant regressors. Suppose we have a regressor $PctES$, the percentage of English speakers in the school, which is given by
$$ PctES = 100 - PctEL$$
and both $PctES$ and $PctEL$ are included in a regression model. One regressor is redundant since the other one conveys the same information. Since this obviously is a case where the regressors can be written as linear combination, we end up with perfect multicollinearity, again.
Let us do this in `r ttcode("R")`.
```{r}
# Percentage of english speakers
CASchools$PctES <- 100 - CASchools$english
# estimate the model
mult.mod <- lm(score ~ STR + english + PctES, data = CASchools)
# obtain a model summary
summary(mult.mod)
```
Once more, `r ttcode("lm()")` refuses to estimate the full model using OLS and excludes `r ttcode("PctES")`.
For more explanation of perfect multicollinearity and its impact on the OLS estimator in general multiple regression models using matrix notation, refer to Chapter 18.1 of the book.
#### Imperfect Multicollinearity {-}
As opposed to perfect multicollinearity, imperfect multicollinearity is --- to a certain extent --- less of a problem. In fact, imperfect multicollinearity is the reason why we are interested in estimating multiple regression models in the first place: the OLS estimator allows us to *isolate* influences of *correlated* regressors on the dependent variable. If it was not for these dependencies, there would not be a reason to resort to a multiple regression approach and we could simply work with a single-regressor model. However, this is rarely the case in applications. We already know that ignoring dependencies among regressors which influence the outcome variable has an adverse effect on estimation results.
So when and why is imperfect multicollinearity a problem? Suppose you have the regression model
$$ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i \tag{6.9} $$
and you are interested in estimating $\beta_1$, the effect on $Y_i$ of a one unit change in $X_{1i}$, while holding $X_{2i}$ constant. You do not know that the true model indeed includes $X_2$. You follow some reasoning and add $X_2$ as a covariate to the model in order to address a potential omitted variable bias. You are confident that $E(u_i\vert X_{1i}, X_{2i})=0$ and that there is no reason to suspect a violation of the assumptions 2 and 3 made in Key Concept 6.4. If $X_1$ and $X_2$ are highly correlated, OLS struggles to precisely estimate $\beta_1$. That means that although $\hat\beta_1$ is a consistent and unbiased estimator for $\beta_1$, it has a large variance due to $X_2$ being included in the model. If the errors are homoskedastic, this issue can be better understood from the formula for the variance of $\hat\beta_1$ in the model (<a href="#mjx-eqn-6.9">6.9</a>) (see Appendix 6.2 of the book):
$$ \sigma^2_{\hat\beta_1} = \frac{1}{n} \left( \frac{1}{1-\rho^2_{X_1,X_2}} \right) \frac{\sigma^2_u}{\sigma^2_{X_1}}. \tag{6.10} $$
First, if $\rho_{X_1,X_2}=0$, i.e., if there is no correlation between both regressors, including $X_2$ in the model has no influence on the variance of $\hat\beta_1$. Secondly, if $X_1$ and $X_2$ are correlated, $\sigma^2_{\hat\beta_1}$ is inversely proportional to $1-\rho^2_{X_1,X_2}$ so the stronger the correlation between $X_1$ and $X_2$, the smaller is $1-\rho^2_{X_1,X_2}$ and thus the bigger is the variance of $\hat\beta_1$. Thirdly, increasing the sample size helps to reduce the variance of $\hat\beta_1$. Of course, this is not limited to the case with two regressors: in multiple regressions, imperfect multicollinearity inflates the variance of one or more coefficient estimators. It is an empirical question which coefficient estimates are severely affected by this and which are not. When the sample size is small, one often faces the decision whether to accept the consequence of adding a large number of covariates (higher variance) or to use a model with only few regressors (possible omitted variable bias). This is called *bias-variance trade-off*.
In sum, undesirable consequences of imperfect multicollinearity are generally not the result of a logical error made by the researcher (as is often the case for perfect multicollinearity) but are rather a problem that is linked to the data used, the model to be estimated and the research question at hand.
### Simulation Study: Imperfect Multicollinearity {-}
Let us conduct a simulation study to illustrate the issues sketched above.
1. We use (<a href="#mjx-eqn-6.8">6.9</a>) as the data generating process and choose $\beta_0 = 5$, $\beta_1 = 2.5$ and $\beta_2 = 3$ and $u_i$ is an error term distributed as $\mathcal{N}(0,5)$. In the first step, we sample the regressor data from a bivariate normal distribution: $$ X_i = (X_{1i}, X_{2i}) \overset{i.i.d.}{\sim} \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 2.5 \\ 2.5 & 10 \end{pmatrix} \right].$$ It is straightforward to see that the correlation between $X_1$ and $X_2$ in the population is rather low:
$$ \rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{2.5}{10} = 0.25. $$
2. Next, we estimate the model (<a href="#mjx-eqn-6.9">6.9</a>) and save the estimates for $\beta_1$ and $\beta_2$. This is repeated $10000$ times with a `for` loop so we end up with a large number of estimates that allow us to describe the distributions of $\hat\beta_1$ and $\hat\beta_2$.
3. We repeat steps 1 and 2 but increase the covariance between $X_1$ and $X_2$ from $2.5$ to $8.5$ such that the correlation between the regressors is high: $$ \rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{8.5}{10} = 0.85. $$
4. In order to assess the effect on the precision of the estimators of increasing the collinearity between $X_1$ and $X_2$ we estimate the variances of $\hat\beta_1$ and $\hat\beta_2$ and compare.
```{r, message=F, warning=F, fig.align='center', cache=T}
# load packages
library(MASS)
library(mvtnorm)
# set number of observations
n <- 50
# initialize vectors of coefficients
coefs1 <- cbind("hat_beta_1" = numeric(10000), "hat_beta_2" = numeric(10000))
coefs2 <- coefs1
# set seed
set.seed(1)
# loop sampling and estimation
for (i in 1:10000) {
# for cov(X_1,X_2) = 0.25
X <- rmvnorm(n, c(0, 0), sigma = cbind(c(10, 2.5), c(2.5, 10)))
u <- rnorm(n, sd = 5)
Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
coefs1[i, ] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]
# for cov(X_1,X_2) = 0.85
X <- rmvnorm(n, c(0, 0), sigma = cbind(c(10, 8.5), c(8.5, 10)))
Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
coefs2[i, ] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]
}
# obtain variance estimates
diag(var(coefs1))
diag(var(coefs2))
```
We are interested in the variances which are the diagonal elements. We see that due to the high collinearity, the variances of $\hat\beta_1$ and $\hat\beta_2$ have more than tripled, meaning it is more difficult to precisely estimate the true coefficients.
## The Distribution of the OLS Estimators in Multiple Regression
As in simple linear regression, different samples will produce different values of the OLS estimators in the multiple regression model. Again, this variation leads to uncertainty of those estimators which we seek to describe using their sampling distribution(s). In short, if the assumption made in Key Concept 6.4 hold, the large sample distribution of $\hat\beta_0,\hat\beta_1,\dots,\hat\beta_k$ become multivariate normal such that the individual estimators themselves are also normally distributed. Key Concept 6.5 summarizes the corresponding statements made in Chapter 6.6 of the book. A more technical derivation of these results can be found in Chapter 18 of the book.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC6.5">
<h3 class = "right"> Key Concept 6.5 </h3>
<h3 class = "left"> Large-sample distribution of $\\hat\\beta_0,\\hat\\beta_1,\\dots,\\hat\\beta_k$ </h3>
If the least squares assumptions in the multiple regression model (see Key Concept 6.4) hold, then, in large samples, the OLS estimators $\\hat\\beta_0,\\hat\\beta_1,\\dots,\\hat\\beta_k$ are jointly normally distributed. We also say that their joint distribution is *multivariate* normal. Further, each $\\hat\\beta_j$ is distributed as $\\mathcal{N}(\\beta_j,\\sigma_{\\beta_j}^2)$.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Large-sample distribution of $\\hat\\beta_0,\\hat\\beta_1,\\dots,\\hat\\beta_k$]{6.5}
If the least squares assumptions in the multiple regression model (see Key Concept 6.4) hold, then, in large samples, the OLS estimators $\\hat\\beta_0,\\hat\\beta_1,\\dots,\\hat\\beta_k$ are jointly normally distributed. We also say that their joint distribution is \\textit{multivariate} normal. Further, each $\\hat\\beta_j$ is distributed as $\\mathcal{N}(\\beta_j,\\sigma_{\\beta_j}^2)$.
\\end{keyconcepts}
')
```
Essentially, Key Concept 6.5 states that, if the sample size is large, we can approximate the individual sampling distributions of the coefficient estimators by specific normal distributions and their joint sampling distribution by a multivariate normal distribution.
How can we use `r ttcode("R")` to get an idea of what the joint PDF of the coefficient estimators in multiple regression model looks like? When estimating a model on some data, we end up with a set of point estimates that do not reveal much information on the *joint* density of the estimators. However, with a large number of estimations using repeatedly randomly sampled data from the same population, we can generate a large set of point estimates that allows us to plot an *estimate* of the joint density function.
The approach to do this in `r ttcode("R")` is a follows:
- Generate $10000$ random samples of size $50$ using the DGP $$ Y_i = 5 + 2.5 \cdot X_{1i} + 3 \cdot X_{2i} + u_i $$ where the regressors $X_{1i}$ and $X_{2i}$ are sampled for each observation as $$ X_i = (X_{1i}, X_{2i}) \sim \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 2.5 \\ 2.5 & 10 \end{pmatrix} \right] $$ and $$ u_i \sim \mathcal{N}(0,5) $$ is an error term.
- For each of the $10000$ simulated sets of sample data, we estimate the model $$ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i $$ and save the coefficient estimates $\hat\beta_1$ and $\hat\beta_2$.
- We compute a density estimate of the joint distribution of $\hat\beta_1$ and $\hat\beta_2$ in the model above using the function `r ttcode("kde2d()")` from the package `r ttcode("MASS")`, see `?MASS`. This estimate is then plotted using the function `r ttcode("persp()")`.
```{r, echo=T, message=F, warning=F, fig.align='center'}
# load packages
library(MASS)
library(mvtnorm)
# set sample size
n <- 50
# initialize vector of coefficients
coefs <- cbind("hat_beta_1" = numeric(10000), "hat_beta_2" = numeric(10000))
# set seed for reproducibility
set.seed(1)
# loop sampling and estimation
for (i in 1:10000) {
X <- rmvnorm(n, c(50, 100), sigma = cbind(c(10, 2.5), c(2.5, 10)))
u <- rnorm(n, sd = 5)
Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
coefs[i,] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]
}
# compute density estimate
kde <- kde2d(coefs[, 1], coefs[, 2])
# plot density estimate
persp(kde,
theta = 310,
phi = 30,
xlab = "beta_1",
ylab = "beta_2",
zlab = "Est. Density",
main = "2D Kernel Density Estimate")
```
From the plot above we can see that the density estimate has some similarity to a bivariate normal distribution (see Chapter \@ref(pt)) though it is not very pretty and probably a little rough. Furthermore, there is a correlation between the estimates such that $\rho\neq0$ in (<a href="#mjx-eqn-2.1">2.1</a>). Also, the distribution's shape deviates from the symmetric bell shape of the bivariate standard normal distribution and has an elliptical surface area instead.
```{r}
# estimate the correlation between estimators
cor(coefs[, 1], coefs[, 2])
```
Where does this correlation come from? Notice that, due to the way we generated the data, correlation between the regressors $X_1$ and $X_2$ appears. Correlation between the regressors in a multiple regression model always translates into correlation between the estimators (see Appendix 6.2 of the book). In our case, the positive correlation between $X_1$ and $X_2$ translates to negative correlation between $\hat\beta_1$ and $\hat\beta_2$. To get a better idea of the distribution you can vary the point of view in the subsequent smooth interactive 3D plot of the same density estimate used for plotting with `r ttcode("persp()")`. Here you can see that the shape of the distribution is somewhat stretched due to $\rho=-0.20$ and it is also apparent that both estimators are unbiased since their joint density seems to be centered close to the true parameter vector $(\beta_1,\beta_2) = (2.5,3)$.
<center>
```{r, echo=F, message=F, warning=F}
if(knitr::opts_knit$get("rmarkdown.pandoc.to") == "html") {
library(plotly)
kde <- kde2d(coefs[, 1], coefs[, 2], n = 100)
p <- plot_ly(x = kde$x, y = kde$y, z = kde$z,
type = "surface", showscale = FALSE)
p %>% layout(scene = list(zaxis = list(title = "Est. Density"
),
xaxis = list(title = "hat_beta_1"
),
yaxis = list(title = "hat_beta_2"
)
)
) %>%
config(showLink = F, displayModeBar = F)
}
```
</center>
```{r, echo=F, purl=F, eval=my_output=="latex", results='asis'}
cat('\\begin{center}\\textit{This interactive part of the book is only available in the HTML version.}\\end{center}')
```
## Exercises {#exercises-6}
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 1. The Boston Housing Data Set {-}
For the course of this section, you will work with <tt>Boston</tt>, the Boston Housing data set which contains 506 observations on housing values in suburbs of Boston. <tt>Boston</tt> data set comes with the package <tt>MASS</tt>. Both the package <tt>MASS</tt> and <tt>AER</tt> are required for the interactive <tt>R</tt> exercises below, and they are already installed.
**Instructions:**
+ Load both the package and the data set.
+ Get yourself an overview over the data using function(s) known from the previous chapters.
+ Estimate a simple linear regression model that explains the median house value of districts (<tt>medv</tt>) by the percent of households with low socioeconomic status, <tt>lstat</tt>, and a constant. Save the model to <tt>bh_mod</tt>.
+ Print a coefficient summary to the console that reports robust standard errors.
<iframe src="DCL/ex6_1.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
**Hint:**
You only need basic <tt>R</tt> functions here: <tt>library()</tt>, <tt>data()</tt>, <tt>lm()</tt> and <tt>coeftest()</tt>.
</div>') } else {
cat('\\begin{center}\\textit{This interactive part of the book is only available in the HTML version.}\\end{center}')
}
```
```{r, echo=F, purl=F, results='asis', eval=my_output == "html"}
cat("
<div class = 'DCexercise'>
#### 2. A Multiple Regression Model of Housing Prices I {-}
Now, let us expand the approach from the previous exercise by adding additional regressors to the model and estimating it again.
As has been discussed in Chapter \\@ref(mofimr), adding regressors to the model improves the fit so the $SER$ decreases and the $R^2$ increases.
The packages <tt>AER</tt> and <tt>MASS</tt> have been loaded. The model object <tt>bh_mod</tt> is available in the environment.
**Instructions:**
+ Regress the median housing value in a district, <tt>medv</tt>, on the average age of the buildings, <tt>age</tt>, the per-capita crime rate, <tt>crim</tt>, the percentage of individuals with low socioeconomic status, <tt>lstat</tt>, and a constant. Put differently, estimate the model $$medv_i = \\beta_0 + \\beta_1 lstat_i + \\beta_2 age_i + \\beta_3 crim_i + u_i.$$
+ Print a coefficient summary to the console that reports robust standard errors for the augmented model.
+ The $R^2$ of the simple regression model is stored in <tt>R2_res</tt>. Save the multiple regression model's $R^2$ to <tt>R2_unres</tt> and check whether the augmented model yields a higher $R^2$. Use <tt><</tt> or <tt>></tt> for the comparison.
<iframe src='DCL/ex6_2.html' frameborder='0' scrolling='no' style='width:100%;height:340px'></iframe>
</div>")
```
```{r, echo=F, purl=F, results='asis', eval=my_output == "html"}
cat('
<div class = "DCexercise">
#### 3. A Multiple Regression Model of Housing Prices II {-}
The equation below describes estimated model from Exercise 2 (heteroskedasticity-robust standard errors in parentheses).
$$ \\widehat{medv}_i = \\underset{(0.74)}{32.828} \\underset{(0.08)}{-0.994} \\times lstat_i \\underset{(0.03)}{-0.083} \\times crim_i + \\underset{(0.02)}{0.038} \\times age_i$$
This model is saved in <tt>bh_mult_mod</tt> which is available in the working environment.
**Instructions:**
As has been stressed in Chapter \\@ref(mofimr), it is not meaningful to use $R^2$ when comparing regression models with a different number of regressors. Instead, the $\\bar{R}^2$ should be used. $\\bar{R}^2$ adjusts for the circumstance that the $SSR$ reduces when a regressor is added to the model.
+ Use the model object to compute the correction factor $CF = \\frac{n-1}{n-k-1}$ where $n$ is the number of observations and $k$ is the number of regressors, excluding the intercept. Save it to <tt>CF</tt>.
+ Use <tt>summary()</tt> to obtain $R^2$ and $\\bar{R}^2$ for <tt>bh_mult_mod</tt>. It is sufficient if you print both values to the console.
+ Check that $$\\bar{R}^2 = 1 - (1-R^2) \\cdot CF.$$ Use the <tt>==</tt> operator.
<iframe src="DCL/ex6_3.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
</div>')
```
```{r, echo=F, purl=F, results='asis', eval=my_output == "html"}
cat('
<div class = "DCexercise">
#### 4. A Fully-Fledged Model for Housing Values? {-}
Have a look at the <a href="https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html">description</a> of the variables contained in the <tt>Boston</tt> data set.
Which variable would you expect to have the lowest $p$-value in a multiple regression model which uses *all* variables as regressors to explain <tt>medv</tt>?
**Instructions:**
+ Regress <tt>medv</tt> on all remaining variables that you find in the Boston data set.
+ Obtain a heteroskedasticity-robust summary of the coefficients.
+ The $\\bar{R}^2$ for the model in exercise 3 is $0.5533$. What can you say about the $\\bar{R}^2$ of the large regression model? Does this model improve on the previous one (no code submission needed)?
The packages <tt>AER</tt> and <tt>MASS</tt> as well as the data set <tt>Boston</tt> are loaded to the working environment.
<iframe src="DCL/ex6_4.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
**Hints:**
+ For brevity, use the regression formula <tt>medv ~.</tt> in your call of <tt>lm()</tt>. This is a shortcut that specifies a regression of <tt>medv</tt> on all variables in the data set supplied to the argument <tt>data</tt>.
+ Use <tt>summary</tt> on both models for a comparison of both $\\bar{R}^2$s.
</div>')
```
```{r, echo=F, purl=F, results='asis', eval=my_output == "html"}
cat('
<div class = "DCexercise">
#### 5. Model Selection {-}
Maybe we can improve the model by dropping a variable?
In this exercise, you have to estimate several models, each time dropping one of the explanatory variables used in the large regression model of Exercise 4 and compare the $\\bar{R}^2$.
The full regression model from the previous exercise, <tt>full_mod</tt>, is available in your environment.
**Instructions:**
+ You are completely free in solving this exercise. We recommend the following approach:
1. Start by estimating a model <tt>mod_new</tt>, say, where, e.g., <tt>lstat</tt> is excluded from the explanatory variables.
Next, access the $\\bar{R}^2$ of this model.
2. Compare the $\\bar{R}^2$ of this model to the $\\bar{R}^2$ of the full model (this was about $0.7338$).
3. Repeat Steps 1 and 2 for all explanatory variables used in the full regression model. Save the model with the highest improvement in $\\bar{R}^2$ to <tt>better_mod</tt>.
<iframe src="DCL/ex6_5.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
</div>')
```