forked from mca91/EconometricsWithR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-ch9.Rmd
1107 lines (802 loc) · 52.1 KB
/
09-ch9.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Assessing Studies Based on Multiple Regression {#asbomr}
```{r, echo=F, purl=F, message=FALSE}
# load the AER package
library(AER)
# load the the data set in the workspace
data(CASchools)
# compute STR and append it to CASchools
CASchools$STR <- CASchools$students/CASchools$teachers
# compute TestScore and append it to CASchools
CASchools$score <- (CASchools$read + CASchools$math)/2
# Add HiSTR to CASchools
CASchools$HiSTR <- as.numeric(CASchools$STR >= 20)
# Add HiEL to CASchools
CASchools$HiEL <- as.numeric(CASchools$english >= 10)
# model (2) for California
TestScore_mod2 <- lm(score ~ STR + english + lunch + log(income),
data = CASchools)
```
The majority of Chapter 9 of the book is of a theoretical nature. Therefore this section briefly reviews the concepts of internal and external validity in general and discusses examples of threats to internal and external validity of multiple regression models. We discuss consequences of
- misspecification of the functional form of the regression function
- measurement errors
- missing data and sample selection
- simultaneous causality
as well as sources of inconsistency of OLS standard errors. We also review concerns regarding internal validity and external validity in the context of forecasting using regression models.
The chapter closes with an application in `r ttcode("R")` where we assess whether results found by multiple regression using the `r ttcode("CASchools")` data can be generalized to school districts of another federal state of the United States.
For a more detailed treatment of these topics we encourage you to work through Chapter 9 of the book.
The following packages and their dependencies are needed for reproduction of the code chunks presented throughout this chapter:
+ `r ttcode("AER")`
+ `r ttcode("mvtnorm")`
+ `r ttcode("stargazer")`
```{r, warning=FALSE, message=FALSE, eval=FALSE}
library(AER)
library(mvtnorm)
library(stargazer)
```
## Internal and External Validity
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat("
<div class = 'keyconcept' id='KC9.1'>
<h3 class = 'right'> Key Concept 9.1 </h3>
<h3 class = 'left'> Internal and External Validity </h3>
A statistical analysis has *internal* validity if the statistical inference made about causal effects are valid for the considered population.
An analysis is said to have *external* validity if inferences and conclusion are valid for the studies' population and can be generalized to other populations and settings.
</div>
")
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat("\\begin{keyconcepts}[Internal and External Validity]{9.1}
A statistical analysis has \\textit{internal} validity if the statistical inference made about causal effects are valid for the considered population.\\newline
An analysis is said to have \\textit{external} validity if inferences and conclusion are valid for the studies' population and can be generalized to other populations and settings.
\\end{keyconcepts}
")
```
#### Threats to Internal Validity {-}
There are two conditions for internal validity to exist:
I. The estimator of the causal effect, which is measured by the coefficient(s) of interest, should be unbiased and consistent.
II. Statistical inference is valid, that is, hypothesis tests should have the desired size and confidence intervals should have the desired coverage probability.
In multiple regression, we estimate the model coefficients using OLS. Thus for condition I to be fulfilled we need the OLS estimator to be unbiased and consistent. For the second condition to be valid, the standard errors must be valid such that hypothesis testing and computation of confidence intervals yield results that are trustworthy. Remember that a sufficient condition for conditions 1. and 2. to be fulfilled is that the assumptions of [Key Concept 6.4](https://www.econometrics-with-r.org/6.4-ols-assumptions-in-multiple-regression.html) hold.
#### Threats to External Validity {-}
External validity might be invalid
- if there are differences between the population studied and the population of interest.
- if there are differences in the *settings* of the considered populations, e.g., the legal framework or the time of the investigation.
## Threats to Internal Validity of Multiple Regression Analysis {#ttivomra}
This section treats five sources that cause the OLS estimator in (multiple) regression models to be biased and inconsistent for the causal effect of interest and discusses possible remedies. All five sources imply a violation of the first least squares assumption presented in Key Concept 6.4.
This sections treats:
- omitted variable bias,
- misspecification of the functional form,
- measurement errors,
- missing data and sample selection,
- simultaneous causality bias.
Beside these threats for consistency of the estimator, we also briefly discuss causes of inconsistent estimation of OLS standard errors.
#### Omitted Variable Bias {-}
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC9.2">
<h3 class = "right"> Key Concept 9.2 </h3>
<h3 class = "left"> Omitted Variable Bias: Should I include More Variables in My Regression? </h3>
Inclusion of additional variables reduces the risk of omitted variable bias but may increase the variance of the estimator of the coefficient of interest.
We present some guidelines that help deciding whether to include an additional variable:
1. Specify the coefficient(s) of interest.
2. Identify the most important potential sources of omitted variable bias by using knowledge available *before* estimating the model. You should end up with a baseline specification and a set of regressors that are questionable.
3. Use different model specifications to test whether questionable regressors have coefficients different from zero.
4. Use tables to provide full disclosure of your results, i.e., present different model specifications that both support your argument and enable the reader to see the effect of including questionable regressors.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Omitted Variable Bias: Should I include More Variables in My Regression? ]{9.2}
Inclusion of additional variables reduces the risk of omitted variable bias but may increase the variance of the estimator of the coefficient of interest.\\newline
We present some guidelines that help deciding whether to include an additional variable:\\newline
\\begin{itemize}
\\item Specify the coefficient(s) of interest.
\\item Identify the most important potential sources of omitted variable bias by using knowledge available \\textit{before} estimating the model. You should end up with a base specification and a set of regressors that are questionable.
\\item Use different model specifications to test whether questionable regressors have coefficients different from zero.
\\item Use tables to provide full disclosure of your results, i.e., present different model specifications that both support your argument and enable the reader to see the effect of including questionable regressors.
\\end{itemize}
\\end{keyconcepts}
')
```
By now you should be aware of omitted variable bias and its consequences. Key Concept 9.2 gives some guidelines on how to proceed if there are control variables that possibly allow to reduce omitted variable bias. If including additional variables to mitigate the bias is not an option because there are no adequate controls, there are different approaches to solve the problem:
+ usage of panel data methods (discussed in Chapter \@ref(rwpd)),
+ usage of instrumental variables regression (discussed in Chapter \@ref(ivr)),
+ usage of a randomized control experiment (discussed in Chapter \@ref(eaqe)).
#### Misspecification of the Functional Form of the Regression Function {-}
If the population regression function is nonlinear but the regression function is linear, the functional form of the regression model is misspecified. This leads to a bias of the OLS estimator.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC9.3">
<h3 class = "right"> Key Concept 9.3 </h3>
<h3 class = "left"> Functional Form Misspecification </h3>
A regression suffers from misspecification of the functional form when the functional form of the estimated regression model differs from the functional form of the population regression function. Functional form misspecification leads to biased and inconsistent coefficient estimators. A way to detect functional form misspecification is to plot the estimated regression function and the data. This may also be helpful to choose the correct functional form.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Functional Form Misspecification]{9.3}
We say a regression suffers from misspecification of the functional form when the functional form of the estimated regression model differs from the functional form of the population regression function. Functional form misspecification leads to biased and inconsistent coefficient estimators. A way to detect functional form misspecification is to plot the estimated regression function and the data. This may also be helpful to choose the correct functional form.
\\end{keyconcepts}
')
```
It is easy to come up with examples of misspecification of the functional form: consider the case where the population regression function is $$Y_i = X_i^2,$$ but the model used is $$Y_i = \beta_0 + \beta_1 X_i + u_i.$$ Clearly, the regression function is misspecified here. We now simulate data and visualize this.
```{r, fig.align='center'}
# set seed for reproducibility
set.seed(3)
# simulate data set
X <- runif(100, -5, 5)
Y <- X^2 + rnorm(100)
# estimate the regression function
ms_mod <- lm(Y ~ X)
ms_mod
```
```{r, fig.align='center'}
# plot the data
plot(X, Y,
main = "Misspecification of Functional Form",
pch = 20,
col = "steelblue")
# plot the linear regression line
abline(ms_mod,
col = "red",
lwd = 2)
legend("bottomright",
bg = "transparent",
cex = 0.8,
lwd = 2,
col ="red",
legend = "Linear Regression Line")
```
It is evident that the regression errors are relatively small for observations close to $X=-3$ and $X=3$ but that the errors increase for $X$ values closer to zero and even more for values beyond $-4$ and $4$. Consequences are drastic: the intercept is estimated to be $8.1$ and for the slope parameter, we obtain an estimate obviously very close to zero. This issue does not disappear as the number of observations is increased because OLS is biased *and* inconsistent due to the misspecification of the regression function.
#### Measurement Error and Errors-in-Variables Bias {-}
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC9.4">
<h3 class = "right"> Key Concept 9.4 </h3>
<h3 class = "left"> Errors-in-Variable Bias </h3>
When independent variables are measured imprecisely, we speak of errors-in-variables bias. This bias does not disappear if the sample size is large. If the measurement error has mean zero and is independent of the affected variable, the OLS estimator of the respective coefficient is biased towards zero.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Errors-in-Variable Bias]{9.4}
When independent variables are measured imprecisely, we speak of errors-in-variables bias. This bias does not disappear if the sample size is large. If the measurement error has mean zero and is independent of the affected variable, the OLS estimator of the respective coefficient is biased towards zero.
\\end{keyconcepts}
')
```
Suppose you are incorrectly measuring the single regressor $X_i$ so that there is a measurement error and you observe $\overset{\sim}{X}_i$ instead of $X_i$. Then, instead of estimating the regression model for the population, which is : $$ Y_i = \beta_0 + \beta_1 X_i + u_i $$, we end up estimating the regression model below:
\begin{align*}
Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + \underbrace{\beta_1 (X_i - \overset{\sim}{X}_i) + u_i}_{=v_i}. \\
Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + v_i.
\end{align*}
where $\overset{\sim}{X}_i$ and the error term $v_i$ are correlated. Thus OLS would be biased and inconsistent for the true $\beta_1$ in this example. One can show that direction and strength of the bias depend on the correlation between the observed regressor, $\overset{\sim}{X}_i$, and the measurement error, $w_i =X_i - \overset{\sim}{X}_i$. This correlation in turn depends on the type of the measurement error made.
The classical measurement error model assumes that the measurement error, $w_i$, has zero mean and that it is uncorrelated with the variable, $X_i$, and the error term of the population regression model, $u_i$:
\begin{equation}
\overset{\sim}{X}_i = X_i + w_i, \ \ \rho_{w_i,u_i}=0, \ \ \rho_{w_i,X_i}=0.
\end{equation}
Then it holds that
\begin{equation}
\widehat{\beta}_1 \xrightarrow{p}{\frac{\sigma_{X}^2}{\sigma_{X}^2 + \sigma_{w}^2}} \beta_1, (\#eq:cmembias)
\end{equation}
which implies inconsistency as $\sigma_{X}^2, \sigma_{w}^2 > 0$ such that the fraction in \@ref(eq:cmembias) is smaller than $1$. Note that there are two extreme cases:
1. If there is no measurement error, $\sigma_{w}^2=0$ such that $\widehat{\beta}_1 \xrightarrow{p}{\beta_1}$.
2. If $\sigma_{w}^2 \gg \sigma_{X}^2$ we have $\widehat{\beta}_1 \xrightarrow{p}{0}$. This is the case if the measurement error is so large that there essentially is no information on $X$ in the data that can be used to estimate $\beta_1$.
The most obvious way to deal with errors-in-variables bias is to use an accurately measured $X$. If this is not possible, instrumental variables regression is an option. One might also deal with the issue by using a mathematical model of the measurement error and adjust the estimates appropriately: if it is plausible that the classical measurement error model applies and if there is information that can be used to estimate the ratio in equation \@ref(eq:cmembias), one could compute an estimate that corrects for the downward bias.
For example, consider two bivariate normally distributed random variables $X,Y$. It is a [well known result](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Bivariate_case) that the conditional expectation function of $Y$ given $X$ has the form
\begin{align}
E(Y\vert X) = E(Y) + \rho_{X,Y} \frac{\sigma_{Y}}{\sigma_{X}}\left[X-E(X)\right]. (\#eq:bnormexpfn)
\end{align} Thus for
\begin{align}
(X, Y) \sim \mathcal{N}\left[\begin{pmatrix}50\\ 100\end{pmatrix},\begin{pmatrix}10 & 5 \\ 5 & 10 \end{pmatrix}\right] (\#eq:bvnormd)
\end{align} according to \@ref(eq:bnormexpfn), the population regression function is
\begin{align*}
Y_i =& \, 100 + 0.5 (X_i - 50) \\
=& \, 75 + 0.5 X_i. (\#eq:bnormregfun)
\end{align*}
Now suppose you gather data on $X$ and $Y$, but that you can only measure $\overset{\sim}{X_i} = X_i + w_i$ with $w_i \overset{i.i.d.}{\sim} \mathcal{N}(0,10)$. Since the $w_i$ are independent of the $X_i$, there is no correlation between the $X_i$ and the $w_i$ so that we have a case of the classical measurement error model. We now illustrate this example in `r ttcode("R")` using the package `r ttcode("mvtnorm")` [@R-mvtnorm].
```{r, eval = T}
# set seed
set.seed(1)
# load the package 'mvtnorm' and simulate bivariate normal data
library(mvtnorm)
dat <- data.frame(rmvnorm(1000, c(50, 100),
sigma = cbind(c(10, 5), c(5, 10))))
# set columns names
colnames(dat) <- c("X", "Y")
```
We now estimate a simple linear regression of $Y$ on $X$ using this sample data and run the same regression again but this time we add i.i.d. $\mathcal{N}(0,10)$ errors added to $X$.
```{r, eval = T}
# estimate the model (without measurement error)
noerror_mod <- lm(Y ~ X,
data = dat)
# estimate the model (with measurement error in X)
dat$X <- dat$X + rnorm(n = 1000, sd = sqrt(10))
error_mod <- lm(Y ~ X,
data = dat)
# print estimated coefficients to console
noerror_mod$coefficients
error_mod$coefficients
```
Next, we visualize the results and compare with the population regression function.
```{r, fig.align='center'}
# plot sample data
plot(dat$X, dat$Y,
pch = 20,
col = "steelblue",
xlab = "X",
ylab = "Y")
# add population regression function
abline(coef = c(75, 0.5),
col = "darkgreen",
lwd = 1.5)
# add estimated regression functions
abline(noerror_mod,
col = "purple",
lwd = 1.5)
abline(error_mod,
col = "darkred",
lwd = 1.5)
# add legend
legend("topleft",
bg = "transparent",
cex = 0.8,
lty = 1,
col = c("darkgreen", "purple", "darkred"),
legend = c("Population", "No Errors", "Errors"))
```
In the situation without measurement error, the estimated regression function is close to the population regression function. Things are different when we use the mismeasured regressor $X$: both the estimate for the intercept and the estimate for the coefficient on $X$ differ considerably from results obtained using the "clean" data on $X$. In particular $\widehat{\beta}_1 = 0.255$, so there is a downward bias. We are in the comfortable situation to know $\sigma_X^2$ and $\sigma^2_w$. This allows us to correct for the bias using \@ref(eq:cmembias). Using this information we obtain the biased-corrected estimate $$\frac{\sigma_X^2 + \sigma_w^2}{\sigma_X^2} \cdot \widehat{\beta}_1 = \frac{10+10}{10} \cdot 0.255 = 0.51$$ which is quite close to $\beta_1=0.5$, the true coefficient from the population regression function.
Bear in mind that the above analysis uses a single sample. Thus one may argue that the results are just a coincidence. Can you show the contrary using a simulation study?
#### Missing Data and Sample Selection {-}
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC9.5">
<h3 class = "right"> Key Concept 9.5 </h3>
<h3 class = "left"> Sample Selection Bias </h3>
When the sampling process influences the availability of data and when there is a relation of this sampling process to the dependent variable that goes beyond the dependence on the regressors, we say that there is a sample selection bias. This bias is due to correlation between one or more regressors and the error term. Sample selection implies both bias and inconsistency of the OLS estimator.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Sample Selection Bias]{9.5}
When the sampling process influences the availability of data and when there is a relation of this sampling process to the dependent variable that goes beyond the dependence on the regressors, we say that there is a sample selection bias. This bias is due to correlation between one or more regressors and the error term. Sample selection implies both bias and inconsistency of the OLS estimator.
\\end{keyconcepts}
')
```
There are three cases of sample selection. Only one of them poses a threat to internal validity of a regression study. The three cases are:
1. Data are missing at random.
2. Data are missing based on the value of a regressor.
3. Data are missing due to a selection process which is related to the dependent variable.
Let us jump back to the example of variables $X$ and $Y$ distributed as stated in equation \@ref(eq:bvnormd) and illustrate all three cases using `r ttcode("R")`.
If data are missing at random, this is nothing but losing the observations. For example, losing $50\%$ of the sample would be the same as never having seen the (randomly chosen) half of the sample observed. Therefore, missing data do not introduce an estimation bias and "only" lead to less efficient estimators.
```{r, fig.align='center'}
# set seed
set.seed(1)
# simulate data
dat <- data.frame(rmvnorm(1000, c(50, 100),
sigma = cbind(c(10, 5), c(5, 10))))
colnames(dat) <- c("X", "Y")
# mark 500 randomly selected observations
id <- sample(1:1000,
size = 500)
plot(dat$X[-id],
dat$Y[-id],
col = "steelblue",
pch = 20,
cex = 0.8,
xlab = "X",
ylab = "Y")
points(dat$X[id],
dat$Y[id],
cex = 0.8,
col = "gray",
pch = 20)
# add the population regression function
abline(coef = c(75, 0.5),
col = "darkgreen",
lwd = 1.5)
# add the estimated regression function for the full sample
abline(noerror_mod)
# estimate model case 1 and add the regression line
dat <- dat[-id, ]
c1_mod <- lm(dat$Y ~ dat$X,
data = dat)
abline(c1_mod,
col = "purple")
# add a legend
legend("bottomright",
lty = 1,
bg = "transparent",
cex = 0.8,
col = c("darkgreen", "black", "purple"),
legend = c("Population", "Full sample", "500 obs. randomly selected"))
```
The gray dots represent the $500$ discarded observations. When using the remaining observations, the estimation results deviate only marginally from the results obtained using the full sample.
Selecting data randomly based on the value of a regressor has also the effect of reducing the sample size and does not introduce estimation bias. We will now drop all observations with $X > 45$, estimate the model again and compare.
```{r, fig.align='center'}
# set random seed
set.seed(1)
# simulate data
dat <- data.frame(rmvnorm(1000, c(50, 100),
sigma = cbind(c(10, 5), c(5, 10))))
colnames(dat) <- c("X", "Y")
# mark observations
id <- dat$X >= 45
plot(dat$X[-id],
dat$Y[-id],
col = "steelblue",
cex = 0.8,
pch = 20,
xlab = "X",
ylab = "Y")
points(dat$X[id],
dat$Y[id],
col = "gray",
cex = 0.8,
pch = 20)
# add population regression function
abline(coef = c(75, 0.5),
col = "darkgreen",
lwd = 1.5)
# add estimated regression function for full sample
abline(noerror_mod)
# estimate model case 1, add regression line
id <- which(dat$X >= 45)
dat <- dat[-id, ]
c2_mod <- lm(dat$Y ~ dat$X,
data = dat)
abline(c2_mod,
col = "purple")
# add legend
legend("topleft",
lty = 1,
bg = "transparent",
cex = 0.8,
col = c("darkgreen", "black", "purple"),
legend = c("Population",
"Full sample",
expression(paste("Obs.with ",X <= 45))))
```
Note that although we dropped more than $90\%$ of all observations, the estimated regression function is very close to the line estimated based on the full sample.
In the third case we face sample selection bias. We can illustrate this by using only observations with $X_i<55$ and $Y_i>100$. These observations are easily identified using the function `r ttcode("which()")` and logical operators: `which(dat$X < 55 & dat$Y > 100)`.
```{r, fig.align='center'}
# set random seed
set.seed(1)
# simulate data
dat <- data.frame(rmvnorm(1000, c(50,100),
sigma = cbind(c(10,5), c(5,10))))
colnames(dat) <- c("X","Y")
# mark observations
id <- which(dat$X <= 55 & dat$Y >= 100)
plot(dat$X[-id],
dat$Y[-id],
col = "gray",
cex = 0.8,
pch = 20,
xlab = "X",
ylab = "Y")
points(dat$X[id],
dat$Y[id],
col = "steelblue",
cex = 0.8,
pch = 20)
# add population regression function
abline(coef = c(75, 0.5),
col = "darkgreen",
lwd = 1.5)
# add estimated regression function for full sample
abline(noerror_mod)
# estimate model case 1, add regression line
dat <- dat[id, ]
c3_mod <- lm(dat$Y ~ dat$X,
data = dat)
abline(c3_mod,
col = "purple")
# add legend
legend("bottomright",
lty = 1,
bg = "transparent",
cex = 0.8,
col = c("darkgreen", "black", "purple"),
legend = c("Population",
"Full sample",
expression(paste(X <= 55,"&",Y >= 100))))
```
We see that the selection process leads to biased estimation results.
There are methods that allow to correct for sample selection bias. However, these methods are beyond the scope of the book and are therefore not considered here. The concept of sample selection bias is summarized in Key Concept 9.5.
#### Simultaneous Causality {-}
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC9.6">
<h3 class = "right"> Key Concept 9.6 </h3>
<h3 class = "left"> Simultaneous Causality Bias </h3>
So far we have assumed that the changes in the independent variable $X$ are responsible for the changes in the dependent variable $Y$. When the reverse is also true, we say that there is *simultaneous causality* between $X$ and $Y$. This reverse causality leads to correlation between $X$ and the error in the population regression of interest such that the coefficient on $X$ is estimated with bias.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Simultaneous Causality Bias]{9.6}
So far we have assumed that the changes in the independent variable $X$ are responsible for the changes in the dependent variable $Y$. When the reverse is also true, we say that there is \\textit{simultaneous causality} between $X$ and $Y$. This reverse causality leads to correlation between $X$ and the error in the population regression of interest such that the coefficient on $X$ is estimated with a bias.
\\end{keyconcepts}
')
```
Suppose we are interested in estimating the effect of a $20\%$ increase in cigarette prices on cigarette consumption in the United States using a multiple regression model. This may be investigated using the dataset `r ttcode("CigarettesSW")` which is a part of the `r ttcode("AER")` package. `r ttcode("CigarettesSW")` is a panel data set on cigarette consumption for all 48 continental U.S. federal states from 1985-1995 and provides data on economic indicators and average local prices, taxes and per capita pack consumption.
After loading the data set, we pick observations for the year 1995 and plot logarithms of the per pack price, `r ttcode("price")`, against pack consumption, `r ttcode("packs")`, and estimate a simple linear regression model.
```{r}
# load the data set
library(AER)
data("CigarettesSW")
c1995 <- subset(CigarettesSW,
year == "1995")
# estimate the model
cigcon_mod <- lm(log(packs) ~ log(price),
data = c1995)
cigcon_mod
```
```{r}
# plot the estimated regression line and the data
plot(log(c1995$price), log(c1995$packs),
xlab = "ln(Price)",
ylab = "ln(Consumption)",
main = "Demand for Cigarettes",
pch = 20,
col = "steelblue")
abline(cigcon_mod,
col = "darkred",
lwd = 1.5)
# add legend
legend("topright",
lty=1,
col= "darkred", "Estimated Regression Line")
```
Remember from Chapter \@ref(nrf) that, due to the log-log specification, in the population regression the coefficient on the logarithm of price is interpreted as the price elasticity of consumption. The estimated coefficient suggests that a $1\%$ increase in cigarette prices reduces cigarette consumption by about $1.2\%$, on average. Have we estimated a demand curve? The answer is no: this is a classic example of simultaneous causality, see Key Concept 9.6. The observations are market equilibrium which are determined by both changes in supply and changes in demand. Therefore the price is correlated with the error term and the OLS estimator is biased. We can neither estimate a demand nor a supply curve consistently using this approach.
We will return to this issue in Chapter \@ref(ivr) which treats instrumental variables regression, an approach that allows consistent estimation when there is simultaneous causality.
#### Sources of Inconsistency of OLS Standard Errors {-}
There are two central threats to computation of consistent OLS standard errors:
1. Heteroskedasticity: implications of heteroskedasticiy have been discussed in Chapter \@ref(htaciitslrm). Heteroskedasticity-robust standard errors as computed by the function `r ttcode("vcovHC()")` from the package `r ttcode("sandwich")` produce valid standard errors under heteroskedasticity.
2. Serial correlation: if the population regression error is correlated across observations, we have serial correlation. This often happens in applications where repeated observations are used, e.g., in panel data studies. As for heteroskedasticity, `r ttcode("vcovHC()")` can be used to obtain valid standard errors when there is serial correlation.
Inconsistent standard errors will produce invalid hypothesis tests and wrong confidence intervals. For example, when testing the null that some model coefficient is zero, we cannot trust the outcome anymore because the test may fail to have a size of $5\%$ due to the wrongly computed standard error.
Key Concept 9.7 summarizes all threats to internal validity discussed above.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC9.7">
<h3 class = "right"> Key Concept 9.7 </h3>
<h3 class = "left"> Threats to Internal Validity of a Regression Study </h3>
The five primary threats to internal validity of a multiple regression study are:
1. Omitted variables.
2. Misspecification of functional form.
3. Errors in variables (measurement errors in the regressors).
4. Sample selection.
5. Simultaneous causality.
All these threats lead to failure of the first least squares assumption $$E(u_i\\vert X_{1i},\\dots ,X_{ki}) \\neq 0,$$ so that the OLS estimator is biased *and* inconsistent. <br>
Furthermore, if one does not adjust for heteroskedasticity *and*/*or* serial correlation, incorrect standard errors may be a threat to internal validity of the study.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Threats to Internal Validity of a Regression Study ]{9.7}
The five primary threats to internal validity of a multiple regression study are:\\newline
\\begin{enumerate}
\\item Omitted variables.
\\item Misspecification of functional form.
\\item Errors in variables (measurement errors in the regressors).
\\item Sample selection.
\\item Simultaneous causality.
\\end{enumerate}\\vspace{0.5cm}
All these threats lead to failure of the first least squares assumption $$E(u_i\\vert X_{1i},\\dots ,X_{ki}) \\neq 0,$$ so that the OLS estimator is biased \\textit{and} inconsistent.\\newline
Furthermore, if one does not adjust for heteroskedasticity \\textit{and}/\\textit{or} serial correlation, incorrect standard errors may be a threat to internal validity of the study.
\\end{keyconcepts}
')
```
## Internal and External Validity when the Regression is used for Forecasting
Recall the regression of test scores on the student-teacher ratio ($STR$) performed in Chapter \@ref(lrwor):
```{r}
linear_model <- lm(score ~ STR,
data = CASchools)
linear_model
```
The estimated regression function was
$$ \widehat{TestScore} = 698.9 - 2.28 \times STR.$$
The book discusses the example of a parent moving to a metropolitan area who plans to choose where to live based on the quality of local schools: a school district's average test score is an adequate measure for the quality. However, the parent has information on the student-teacher ratio only such that test scores need to be predicted. Although we have established that there is omitted variable bias in this model due to omission of variables like student learning opportunities outside school, the share of English learners and so on, `r ttcode("linear_model")` may in fact be useful for the parent:
The parent need not care if the coefficient on $STR$ has causal interpretation, she wants $STR$ to explain as much variation in test scores as possible. Therefore, despite the fact that `r ttcode("linear_model")` cannot be used to estimate the *causal effect* of a change in $STR$ on test scores, it can be considered a *reliable predictor* of test scores in general.
Thus, the threats to internal validity as summarized in Key Concept 9.7 are negligible for the parent. This is, as instanced in the book, different for a superintendent who has been tasked to take measures that increase test scores: she requires a more reliable model that does not suffer from the threats listed in Key Concept 9.7.
Consult Chapter 9.3 of the book (Stock and Watson) for the corresponding discussion.
## Example: Test Scores and Class Size {#etsacs}
This section discusses internal and external validity of the results gained from analyzing the California test score data using multiple regression models.
#### External Validity of the Study {-}
External validity of the California test score analysis means that its results can be generalized. Whether this is possible depends on the population and the setting. Following the book we conduct the same analysis using data for fourth graders in $220$ public school districts in Massachusetts in 1998. Like `r ttcode("CASchools")`, the data set `r ttcode("MASchools")` is part of the `r ttcode("AER")` package [@R-AER]. Use the help function (`?MASchools`) to get information on the definitions of all the variables contained.
We start by loading the data set and proceed by computing some summary statistics.
```{r, warning=FALSE, message=FALSE}
# attach the 'MASchools' dataset
data("MASchools")
summary(MASchools)
```
It is fairly easy to replicate key components of Table 9.1 of the book using `r ttcode("R")`. To be consistent with variable names used in the `r ttcode("CASchools")` data set, we do some formatting beforehand.
```{r}
# Customized variables in MASchools
MASchools$score <- MASchools$score4
MASchools$STR <- MASchools$stratio
# Reproduce Table 9.1 of the book
vars <- c("score", "STR", "english", "lunch", "income")
cbind(CA_mean = sapply(CASchools[, vars], mean),
CA_sd = sapply(CASchools[, vars], sd),
MA_mean = sapply(MASchools[, vars], mean),
MA_sd = sapply(MASchools[, vars], sd))
```
The summary statistics reveal that the average test score is higher for school districts in Massachusetts. The test used in Massachusetts is somewhat different from the one used in California (the Massachusetts test score also includes results for the school subject "Science"), therefore a direct comparison of test scores is not appropriate. We also see that, on average, classes are smaller in Massachusetts than in California and that the average district income, average percentage of English learners as well as the average share of students receiving subsidized lunch differ considerably from the averages computed for California. There are also notable differences in the observed dispersion of the variables.
Following the book we examine the relationship between district income and test scores in Massachusetts as we have done before in Chapter \@ref(nrf) for the California data and reproduce Figure 9.2 of the book.
```{r}
# estimate linear model
Linear_model_MA <- lm(score ~ income,
data = MASchools)
Linear_model_MA
# estimate linear-log model
Linearlog_model_MA <- lm(score ~ log(income),
data = MASchools)
Linearlog_model_MA
# estimate Cubic model
cubic_model_MA <- lm(score ~ I(income) + I(income^2) + I(income^3),
data = MASchools)
cubic_model_MA
```
```{r, fig.align='center'}
# plot data
plot(MASchools$income, MASchools$score,
pch = 20,
col = "steelblue",
xlab = "District income",
ylab = "Test score",
xlim = c(0, 50),
ylim = c(620, 780))
# add estimated regression line for the linear model
abline(Linear_model_MA,
lwd = 2)
# add estimated regression function for Linear-log model
order_id <- order(MASchools$income)
lines(MASchools$income[order_id],
fitted(Linearlog_model_MA)[order_id],
col = "darkgreen",
lwd = 2)
# add estimated cubic regression function
lines(x = MASchools$income[order_id],
y = fitted(cubic_model_MA)[order_id],
col = "orange",
lwd = 2)
# add a legend
legend("topleft",
legend = c("Linear", "Linear-Log", "Cubic"),
lty = 1,
col = c("Black", "darkgreen", "orange"))
```
The plot indicates that the cubic specification fits the data best. Interestingly, this is different from the `r ttcode("CASchools")` data where the pattern of nonlinearity is better described by the linear-log specification.
We continue by estimating most of the model specifications used for analysis of the `r ttcode("CASchools")` data set in Chapter \@ref(nrf) and use `r ttcode("stargazer()")` [@R-stargazer] to generate a tabular representation of the regression results.
```{r, eval=FALSE}
# add 'HiEL' to 'MASchools'
MASchools$HiEL <- as.numeric(MASchools$english > median(MASchools$english))
# estimate the model specifications from Table 9.2 of the book
TSMA_mod1 <- lm(score ~ STR,
data = MASchools)
TSMA_mod2 <- lm(score ~ STR + english + lunch + log(income),
data = MASchools)
TSMA_mod3 <- lm(score ~ STR + english + lunch + income + I(income^2)
+ I(income^3),
data = MASchools)
TSMA_mod4 <- lm(score ~ STR + I(STR^2) + I(STR^3) + english + lunch + income
+ I(income^2) + I(income^3),
data = MASchools)
TSMA_mod5 <- lm(score ~ STR + HiEL + I(income^2) + I(income^3) + HiEL:STR
+ lunch + income,
data = MASchools)
TSMA_mod6 <- lm(score ~ STR + I(income^2) + I(income^3)+ lunch
+ income,
data = MASchools)
# gather robust standard errors
rob_se <- list(sqrt(diag(vcovHC(TSMA_mod1, type = "HC1"))),
sqrt(diag(vcovHC(TSMA_mod2, type = "HC1"))),
sqrt(diag(vcovHC(TSMA_mod3, type = "HC1"))),
sqrt(diag(vcovHC(TSMA_mod4, type = "HC1"))),
sqrt(diag(vcovHC(TSMA_mod5, type = "HC1"))),
sqrt(diag(vcovHC(TSMA_mod6, type = "HC1"))))
# generate a table with 'stargazer()'
library(stargazer)
stargazer(TSMA_mod1, TSMA_mod2, TSMA_mod3,
TSMA_mod4, TSMA_mod5, TSMA_mod6,
title = "Regressions Using Massachusetts Test Score Data",
type = "latex",
digits = 3,
header = FALSE,
se = rob_se,
object.names = TRUE,
model.numbers = FALSE,
column.labels = c("(I)", "(II)", "(III)", "(IV)", "(V)", "(VI)"))
```
<!--html_preserve-->
```{r, results='asis', echo=F, cache=T, message=FALSE, warning=FALSE, eval=my_output == "html"}
library(TTR)
library(stargazer)
MASchools$HiEL <- as.numeric(MASchools$english > median(MASchools$english))
TSMA_mod1 <- lm(score ~ STR,
data = MASchools)
TSMA_mod2 <- lm(score ~ STR + english + lunch + log(income),
data = MASchools)
TSMA_mod3 <- lm(score ~ STR + english + lunch + income
+ I(income^2) + I(income^3),
data = MASchools)
TSMA_mod4 <- lm(score ~ STR + I(STR^2) + I(STR^3) + english
+lunch + income + I(income^2) + I(income^3),
data = MASchools)
TSMA_mod5 <- lm(score ~ STR + HiEL + HiEL:STR + lunch + income
+ I(income^2) + I(income^3),
data = MASchools)
TSMA_mod6 <- lm(score ~ STR + lunch + income + I(income^2)
+ I(income^3),
data = MASchools)
rob_se <- list(
sqrt(diag(vcovHC(TSMA_mod1, type="HC1"))),
sqrt(diag(vcovHC(TSMA_mod2, type="HC1"))),
sqrt(diag(vcovHC(TSMA_mod3, type="HC1"))),
sqrt(diag(vcovHC(TSMA_mod4, type="HC1"))),
sqrt(diag(vcovHC(TSMA_mod5, type="HC1"))),
sqrt(diag(vcovHC(TSMA_mod6, type="HC1")))
)
stargazer(TSMA_mod1, TSMA_mod2, TSMA_mod3,
TSMA_mod4, TSMA_mod5, TSMA_mod6,
se = rob_se,
type = "html",
header = FALSE,
model.numbers = FALSE,
dep.var.caption = "Dependent Variable: Score",
column.sep.width = "1pt",
column.labels = c("(I)", "(II)", "(III)", "(IV)", "(V)", "(VI)")
)
stargazer_html_title("Regressions Using Massachusetts Test Score Data", "rumtsd")
```
<!--/html_preserve-->
```{r, results='asis', echo=F, cache=T, message=FALSE, warning=FALSE, eval=my_output == "latex"}
library(stargazer)
MASchools$HiEL <- as.numeric(MASchools$english > median(MASchools$english))
TSMA_mod1 <- lm(score ~ STR,
data = MASchools)
TSMA_mod2 <- lm(score ~ STR + english + lunch + log(income),
data = MASchools)
TSMA_mod3 <- lm(score ~ STR + english + lunch + income
+ I(income^2) + I(income^3),
data = MASchools)
TSMA_mod4 <- lm(score ~ STR + I(STR^2) + I(STR^3) + english
+ lunch + income + I(income^2)+ I(income^3),
data = MASchools)
TSMA_mod5 <- lm(score ~ STR + HiEL + HiEL:STR + lunch
+ income + I(income^2) + I(income^3),
data = MASchools)
TSMA_mod6 <- lm(score ~ STR + lunch + income + I(income^2)
+ I(income^3),
data = MASchools)
rob_se <- list(
sqrt(diag(vcovHC(TSMA_mod1, type="HC1"))),
sqrt(diag(vcovHC(TSMA_mod2, type="HC1"))),
sqrt(diag(vcovHC(TSMA_mod3, type="HC1"))),
sqrt(diag(vcovHC(TSMA_mod4, type="HC1"))),
sqrt(diag(vcovHC(TSMA_mod5, type="HC1"))),
sqrt(diag(vcovHC(TSMA_mod6, type="HC1")))
)
stargazer(TSMA_mod1, TSMA_mod2, TSMA_mod3,
TSMA_mod4, TSMA_mod5, TSMA_mod6,
digits = 3,
title = "\\label{tab:rumtsd} Regressions Using Massachusetts Test Score Data",
type = "latex",
float.env = "sidewaystable",
column.sep.width = "-7pt",
se = rob_se,
omit.stat = "f",
model.numbers = FALSE,
column.labels = c("(I)", "(II)", "(III)", "(IV)", "(V)", "(VI)")
)
```
Next we reproduce the $F$-statistics and $p$-values for testing exclusion of groups of variables.
```{r}
# F-test model (3)
linearHypothesis(TSMA_mod3,
c("I(income^2)=0", "I(income^3)=0"),
vcov. = vcovHC(TSMA_mod3, type = "HC1"))
# F-tests model (4)
linearHypothesis(TSMA_mod4,
c("STR=0", "I(STR^2)=0", "I(STR^3)=0"),
vcov. = vcovHC(TSMA_mod4, type = "HC1"))
linearHypothesis(TSMA_mod4,
c("I(STR^2)=0", "I(STR^3)=0"),
vcov. = vcovHC(TSMA_mod4, type = "HC1"))
linearHypothesis(TSMA_mod4,
c("I(income^2)=0", "I(income^3)=0"),
vcov. = vcovHC(TSMA_mod4, type = "HC1"))
#F-tests model (5)
linearHypothesis(TSMA_mod5,
c("STR=0", "STR:HiEL=0"),
vcov. = vcovHC(TSMA_mod5, type = "HC1"))
linearHypothesis(TSMA_mod5,
c("I(income^2)=0", "I(income^3)=0"),
vcov. = vcovHC(TSMA_mod5, type = "HC1"))
linearHypothesis(TSMA_mod5,
c("HiEL=0", "STR:HiEL=0"),
vcov. = vcovHC(TSMA_mod5, type = "HC1"))
# F-tests model (6)
linearHypothesis(TSMA_mod6,
c("I(income^2)=0", "I(income^3)=0"),
vcov. = vcovHC(TSMA_mod6, type = "HC1"))
```
We see that, in terms of $\bar{R}^2$, specification (3) which uses a cubic to model the relationship between district income and test scores indeed performs better than the linear-log specification (2). Using different $F$-tests on models (4) and (5), we cannot reject the hypothesis that there is no nonlinear relationship between student teacher ratio and test score and also that the share of English learners has an influence on the relationship of interest. Furthermore, regression (6) shows that the percentage of English learners can be omitted as a regressor. Because of the model specifications made in (4) to (6) do not lead to substantially different results than those of regression (3), we choose model (3) as the most suitable specification.
In comparison to the California data, we observe the following results:
1. Controlling for the students' background characteristics in model specification (2) reduces the coefficient of interest (student-teacher ratio) by roughly $60\%$. The estimated coefficients are close to each other.
2. The coefficient on student-teacher ratio is always significantly different from zero at the level of $1\%$ for both data sets. This holds for all considered model specifications in both studies.
3. In both studies, the share of English learners in a school district is of little importance for the estimated impact of a change in the student-teacher ratio on test score.
The biggest difference is that, in contrast to the California results, we do not find evidence of a nonlinear relationship between test scores and the student teacher ratio for the Massachusetts data since the corresponding $F$-tests for model (4) do not reject.
As pointed out in the book, the test scores for California and Massachusetts have different units because the underlying tests are different. Thus the estimated coefficients on the student-teacher ratio in both regressions cannot be compared before standardizing test scores to the same units as $$\frac{Testscore - \overline{TestScore}}{\sigma_{TestScore}}$$ for all observations in both data sets and running the regressions of interest using the standardized data again. One can show that the coefficient on student-teacher ratio in the regression using standardized test scores is the coefficient of the original regression divided by the standard deviation of test scores.
For model (3) of the Massachusetts data, the estimated coefficient on the student-teacher ratio is $-0.64$. A reduction of the student-teacher ratio by two students is predicted to increase test scores by $-2 \cdot (-0.64) = 1.28$ points. Thus we can compute the effect of a reduction of student-teacher ratio by two students on the standardized test scores as follows:
```{r}
TSMA_mod3$coefficients[2] / sd(MASchools$score) * (-2)
```
For Massachusetts, the predicted increase of test scores due to a reduction of the student-teacher ratio by two students is $0.085$ standard deviations of the distribution of the observed distribution of test scores.
Using the linear specification (2) for California, the estimated coefficient on the student-teacher ratio is $-0.73$ so the predicted increase of test scores induced by a reduction of the student-teacher ratio by two students is $-0.73 \cdot (-2) = 1.46$. We use `r ttcode("R")` to compute the predicted change in standard deviation units:
```{r}
TestScore_mod2$coefficients[2] / sd(CASchools$score) * (-2)