-
Notifications
You must be signed in to change notification settings - Fork 272
/
Copy path04-ch4.Rmd
1291 lines (869 loc) · 65.1 KB
/
04-ch4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Linear Regression with One Regressor {#lrwor}
This chapter introduces the basics in linear regression and shows how to perform regression analysis in `r ttcode("R")`. In linear regression, the aim is to model the relationship between a dependent variable $Y$ and one or more explanatory variables denoted by $X_1, X_2, \dots, X_k$. Following the book we will focus on the concept of simple linear regression throughout the whole chapter. In simple linear regression, there is just one explanatory variable $X_1$. <br>
If, for example, a school cuts its class sizes by hiring new teachers, that is, the school lowers $X_1$, the student-teacher ratios of its classes, how would this affect $Y$, the performance of the students involved in a standardized test? With linear regression we can not only examine whether the student-teacher ratio *does have* an impact on the test results but we can also learn about the *direction* and the *strength* of this effect.
The following packages are needed for reproducing the code presented in this chapter:
+ `r ttcode("AER")` - accompanies the Book *Applied Econometrics with R* @kleiber2008 and provides useful functions and data sets.
+ `r ttcode("MASS")` - a collection of functions for applied statistics.
Make sure these are installed before you go ahead and try to replicate the examples. The safest way to do so is by checking whether the following code chunk executes without any errors.
```{r, warning=FALSE, message=FALSE, eval=FALSE}
library(AER)
library(MASS)
```
## Simple Linear Regression
To start with an easy example, consider the following combinations of average test score and the average student-teacher ratio in some fictional school districts.
```{r kable, echo=F, purl=F,, eval=my_output=="html"}
library(knitr)
library(kableExtra)
frame <- data.frame(
TestScore = c(680, 640, 670, 660, 630, 660, 635),
STR = c(15, 17, 19, 20, 22, 23.5, 25)
)
rownames(frame) <- 1:7
t(frame) %>% kable("latex", booktabs = T) %>%
kable_styling(latex_options = "striped")
```
```{r, echo=F, purl=F,, eval=my_output=="latex"}
library(knitr)
library(kableExtra)
frame <- data.frame(
TestScore = c(680, 640, 670, 660, 630, 660, 635),
STR = c(15, 17, 19, 20, 22, 23.5, 25)
)
rownames(frame) <- 1:7
kable(t(frame), "latex", booktabs = T) %>% kable_styling(position = "center")
```
To work with these data in `r ttcode("R")` we begin by generating two vectors: one for the student-teacher ratios (`r ttcode("STR")`) and one for test scores (`r ttcode("TestScore")`), both containing the data from the table above.
```{r}
# Create sample data
STR <- c(15, 17, 19, 20, 22, 23.5, 25)
TestScore <- c(680, 640, 670, 660, 630, 660, 635)
# Print out sample data
STR
TestScore
```
To build simple linear regression model, we hypothesize that the relationship between dependent and independent variable is linear, formally: $$ Y = b \cdot X + a. $$ For now, let us suppose that the function which relates test score and student-teacher ratio
to each other is $$TestScore = 713 - 3 \times STR.$$
It is always a good idea to visualize the data you work with. Here, it is suitable to use `r ttcode("plot()")` to produce a scatterplot with `r ttcode("STR")` on the $x$-axis and `r ttcode("TestScore")` on the $y$-axis. Just call `plot(y_variable ~ x_variable)` whereby `r ttcode("y_variable")` and `r ttcode("x_variable")` are placeholders for the vectors of observations we want to plot. Furthermore, we might want to add a systematic relationship to the plot. To draw a straight line, `r ttcode("R")` provides the function `r ttcode("abline()")`. We just have to call this function with arguments `r ttcode("a")` (representing the intercept)
and `r ttcode("b")` (representing the slope) after executing `r ttcode("plot()")` in order to add the line to our plot.
The following code reproduces Figure 4.1 from the textbook.
```{r , echo=TRUE, fig.align='center', cache=TRUE}
# create a scatterplot of the data
plot(TestScore ~ STR,ylab="Test Score",pch=20)
# add the systematic relationship to the plot
abline(a = 713, b = -3)
```
We find that the line does not touch any of the points although we claimed that it represents the systematic relationship. The reason for this is randomness. Most of the time there are additional influences which imply that there is no bivariate relationship between the two variables.
In order to account for these differences between observed data and the systematic relationship, we extend our model from above by an *error term* $u$ which captures additional random effects. Put differently, $u$ accounts for all the differences between the regression line and the actual observed data. Beside pure randomness, these deviations could also arise from measurement errors or, as will be discussed later, could be the consequence of leaving out other factors that are relevant in explaining the dependent variable.
Which other factors are plausible in our example? For one thing, the test scores might be driven by the teachers' quality and the background of the students. It is also possible that in some classes, the students were lucky on the test days and thus achieved higher scores. For now, we will summarize such influences by an additive component:
$$ TestScore = \beta_0 + \beta_1 \times STR + \text{other factors}. $$
Of course this idea is very general as it can be easily extended to other situations that can be described with a linear model. Hence, the basic linear regression model we will work with is
$$ Y_i = \beta_0 + \beta_1 X_i + u_i. $$
Key Concept 4.1 summarizes the linear regression model and its terminology.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC4.1">
<h3 class = "right"> Key Concept 4.1 </h3>
<h3 class = "left"> Terminology for the Linear Regression Model with a Single Regressor </h3>
<p> The linear regression model is
$$Y_i = \\beta_0 + \\beta_1 X_i + u_i,$$
where
- the index $i$ runs over the observations, $i=1,\\dots,n$;
- $Y_i$ is the *dependent variable*, the *regressand*, or simply the *left-hand variable*;
- $X_i$ is the *independent variable*, the *regressor*, or simply the *right-hand variable*;
- $Y = \\beta_0 + \\beta_1 X$ is the *population regression line* also called the *population regression function*;
- $\\beta_0$ is the *intercept* of the population regression line;
- $\\beta_1$ is the *slope* of the population regression line;
- $u_i$ is the *error term*.
</p>
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Terminology for the Linear Regression Model with a Single Regressor]{4.1}
The linear regression model is $$Y_i = \\beta_0 + \\beta_1 X_1 + u_i,$$
where
\\begin{itemize}
\\item the index $i$ runs over the observations, $i=1,\\dots,n$;
\\item $Y_i$ is the \\textit{dependent variable}, the \\textit{regressand}, or simply the \\textit{left-hand variable};
\\item $X_i$ is the \\textit{independent variable}, the \\textit{regressor}, or simply the \\textit{right-hand variable};
\\item $Y = \\beta_0 + \\beta_1 X$ is the \\textit{population regression line} also called the \\textit{population regression function};
\\item $\\beta_0$ is the \\textit{intercept} of the population regression line;
\\item $\\beta_1$ is the \\textit{slope} of the population regression line;
\\item $u_i$ is the \\textit{error term}.
\\end{itemize}
\\end{keyconcepts}
')
```
## Estimating the Coefficients of the Linear Regression Model
In practice, the intercept $\beta_0$ and slope $\beta_1$ of the population regression line are unknown. Therefore, we must employ data to estimate both unknown parameters. In the following, a real world example will be used to demonstrate how this is achieved. We want to relate test scores to student-teacher ratios measured in Californian schools. The test score is the district-wide average of reading and math scores for fifth graders. Again, the class size is measured as the number of students divided by the number of teachers (the student-teacher ratio). As for the data, the California School data set (`r ttcode("CASchools")`) comes with an `r ttcode("R")` package called `r ttcode("AER")`, an acronym for [Applied Econometrics with R](https://cran.r-project.org/web/packages/AER/AER.pdf) [@R-AER]. After installing the package with `r ttcode('install.packages("AER")')` and attaching it with `r ttcode('library(AER)')` the data set can be loaded using the function `r ttcode("data()")`.
```{r, message=FALSE, warning=FALSE, eval=5:8}
# install the AER package (once)
install.packages("AER")
# load the AER package
library(AER)
# load the the data set in the workspace
data(CASchools)
```
Once a package has been installed it is available for use at further occasions when invoked with `r ttcode("library()")` --- there is no need to run `r ttcode("install.packages()")` again!
It is interesting to know what kind of object we are dealing with.
`r ttcode("class()")` returns the class of an object. Depending on the class of an object some functions (for example `r ttcode("plot()")` and `r ttcode("summary()")`) behave differently.
Let us check the class of the object `r ttcode("CASchools")`.
```{r}
class(CASchools)
```
It turns out that `r ttcode("CASchools")` is of class `r ttcode("data.frame")` which is a convenient format to work with, especially for performing regression analysis.
With help of `r ttcode("head()")` we get a first overview of our data. This function shows only the first 6 rows of the data set which prevents an overcrowded console output.
```{block2, note-text, type='rmdnote'}
Press <tt>ctrl + L</tt> to clear the console. This command deletes any code that has been typed in and executed by you or printed to the console by <tt>R</tt> functions. The good news is that anything else is left untouched. You neither lose defined variables etc. nor the code history. It is still possible to recall previously executed <tt>R</tt> commands using the up and down keys. If you are working in *RStudio*, press <tt>ctrl + Up</tt> on your keyboard (<tt>CMD + Up</tt> on a Mac) to review a list of previously entered commands.
```
```{r}
head(CASchools)
```
We find that the data set consists of plenty of variables and that most of them are numeric.
By the way: an alternative to `r ttcode("class()")` and `r ttcode("head()")` is `r ttcode("str()")` which is deduced from 'structure' and gives a comprehensive overview of the object. Try!
Turning back to `r ttcode("CASchools")`, the two variables we are interested in (i.e., average test score and the student-teacher ratio) are *not* included. However, it is possible to calculate both from the provided data. To obtain the student-teacher ratios, we simply divide the number of students by the number of teachers. The average test score is the arithmetic mean of the test score for reading and the score of the math test. The next code chunk shows how the two variables can be constructed as vectors and how they are appended to `r ttcode("CASchools")`.
```{r}
# compute STR and append it to CASchools
CASchools$STR <- CASchools$students/CASchools$teachers
# compute TestScore and append it to CASchools
CASchools$score <- (CASchools$read + CASchools$math)/2
```
If we ran `r ttcode("head(CASchools)")` again we would find the two variables of interest as additional columns named `r ttcode("STR")` and `r ttcode("score")` (check this!).
Table 4.1 from the textbook summarizes the distribution of test scores and student-teacher ratios. There are several functions which can be used to produce similar results, e.g.,
- `r ttcode("mean()")` (computes the arithmetic mean of the provided numbers),
- `r ttcode("sd()")` (computes the sample standard deviation),
- `r ttcode("quantile()")` (returns a vector of the specified sample quantiles for the data).
The next code chunk shows how to achieve this. First, we compute summary statistics on the columns `r ttcode("STR")` and `r ttcode("score")` of `r ttcode("CASchools")`. In order to get nice output we gather the measures in a `r ttcode("data.frame")` named `r ttcode("DistributionSummary")`.
```{r Table 4.1, results='hold'}
# compute sample averages of STR and score
avg_STR <- mean(CASchools$STR)
avg_score <- mean(CASchools$score)
# compute sample standard deviations of STR and score
sd_STR <- sd(CASchools$STR)
sd_score <- sd(CASchools$score)
# set up a vector of percentiles and compute the quantiles
quantiles <- c(0.10, 0.25, 0.4, 0.5, 0.6, 0.75, 0.9)
quant_STR <- quantile(CASchools$STR, quantiles)
quant_score <- quantile(CASchools$score, quantiles)
# gather everything in a data.frame
DistributionSummary <- data.frame(Average = c(avg_STR, avg_score),
StandardDeviation = c(sd_STR, sd_score),
quantile = rbind(quant_STR, quant_score))
# print the summary to the console
DistributionSummary
```
As for the sample data, we use `r ttcode("plot()")`. This allows us to detect characteristics of our data, such as outliers which are harder to discover by looking at mere numbers. This time we add some additional arguments to the call of `r ttcode("plot()")`.
The first argument in our call of `r ttcode("plot()")`, `r ttcode("score ~ STR")`, is again a formula that states variables on the y- and the x-axis. However, this time the two variables are not saved in separate vectors but are columns of `r ttcode("CASchools")`. Therefore, `r ttcode("R")` would not find them without the argument `r ttcode("data")` being correctly specified. `r ttcode("data")` must be in accordance with the name of the `r ttcode("data.frame")` to which the variables belong to, in this case `r ttcode("CASchools")`. Further arguments are used to change the appearance of the plot: `r ttcode("main")` adds a title, `r ttcode("xlab")` and `r ttcode("ylab")` add custom labels to both axes.
```{r, fig.align='center'}
plot(score ~ STR,
data = CASchools,
main = "Scatterplot of Test Score and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)")
```
The plot (Figure 4.2 in the book) shows the scatterplot of all observations on the student-teacher ratio and test score. We see that the points are strongly scattered, and that the variables are negatively correlated. That is, we expect to observe lower test scores in bigger classes.
The function `r ttcode("cor()")` (see `r ttcode("?cor")` for further info) can be used to compute the correlation between two *numeric* vectors.
```{r}
cor(CASchools$STR, CASchools$score)
```
As the scatterplot already suggests, the correlation is negative but rather weak.
The task we are currently facing is to find a line that best fits the data. We could opt for graphical inspection and correlation analysis and then select the best fitting line by eyeballing. However, this would be rather subjective: different observers would draw different regression lines. On this account, we are interested in techniques that are less arbitrary. Such a technique is given by ordinary least squares (OLS) estimation.
### The Ordinary Least Squares Estimator {-}
The OLS estimator chooses the regression coefficients such that the estimated regression line is as "close" as possible to the observed data points. Here, closeness is measured by the sum of the squared mistakes made in predicting $Y$ given $X$. Let $b_0$ and $b_1$ be some estimators of $\beta_0$ and $\beta_1$. Then the sum of squared estimation mistakes can be expressed as
$$ \sum^n_{i = 1} (Y_i - b_0 - b_1 X_i)^2. $$
The OLS estimator in the simple regression model is the pair of estimators for intercept and slope that minimizes the expression above. The derivation of the OLS estimators for both parameters are presented in Appendix 4.1 of the book. The results are summarized in Key Concept 4.2.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC4.2">
<h3 class = "right"> Key Concept 4.2 </h3>
<h3 class = "left"> The OLS Estimator, Predicted Values, and Residuals </h3>
<p> The OLS estimators of the slope $\\beta_1$ and the intercept $\\beta_0$ in the simple linear regression model are
\\begin{align}
\\hat\\beta_1 & = \\frac{ \\sum_{i = 1}^n (X_i - \\overline{X})(Y_i - \\overline{Y}) } { \\sum_{i=1}^n (X_i - \\overline{X})^2}, \\\\
\\\\
\\hat\\beta_0 & = \\overline{Y} - \\hat\\beta_1 \\overline{X}.
\\end{align}
The OLS predicted values $\\widehat{Y}_i$ and residuals $\\hat{u}_i$ are
\\begin{align}
\\widehat{Y}_i & = \\hat\\beta_0 + \\hat\\beta_1 X_i,\\\\
\\\\
\\hat{u}_i & = Y_i - \\widehat{Y}_i.
\\end{align}
The estimated intercept $\\hat{\\beta}_0$, the slope parameter $\\hat{\\beta}_1$ and the residuals $\\left(\\hat{u}_i\\right)$ are computed from a sample of $n$ observations of $X_i$ and $Y_i$,$i$=$1$, $...$,$n$. These are *estimates* of the unknown population intercept $\\left(\\beta_0 \\right)$, slope $\\left(\\beta_1\\right)$, and error term $(u_i)$.
</p>
The formulas presented above may not be very intuitive at first glance. The following interactive application aims to help you understand the mechanics of OLS. You can add observations by clicking into the coordinate system where the data are represented by points. Once two or more observations are available, the application computes a regression line using OLS and some statistics which are displayed in the right panel. The results are updated as you add further observations to the left panel. A double-click resets the application, i.e., all data are removed.
<iframe height="410" width="900" frameborder="0" scrolling="no" src="SimpleRegression.html"></iframe>
</div>')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[The OLS Estimator, Predicted Values, and Residuals]{4.2}
The OLS estimators of the slope $\\beta_1$ and the intercept $\\beta_0$ in the simple linear regression model are
\\begin{align*}
\\hat\\beta_1 & = \\frac{ \\sum_{i = 1}^n (X_i - \\overline{X})(Y_i - \\overline{Y}) } { \\sum_{i=1}^n (X_i - \\overline{X})^2}, \\\\
\\hat\\beta_0 & = \\overline{Y} - \\hat\\beta_1 \\overline{X}.
\\intertext{The OLS predicted values $\\widehat{Y}_i$ and residuals $\\hat{u}_i$ are}
\\widehat{Y}_i & = \\hat\\beta_0 + \\hat\\beta_1 X_i,\\\\
\\hat{u}_i & = Y_i - \\widehat{Y}_i.
\\end{align*}
The estimated intercept $\\hat{\\beta}_0$, the slope parameter $\\hat{\\beta}_1$ and the residuals $\\left(\\hat{u}_i\\right)$ are computed from a sample of $n$ observations of $X_i$ and $Y_i$, $i$=$1$,$...$,$n$. These are \\textit{estimates} of the unknown true population intercept $\\left(\\beta_0 \\right)$, slope $\\left(\\beta_1\\right)$, and error term $(u_i)$.
\\end{keyconcepts}
')
```
There are many possible ways to compute $\hat{\beta}_0$ and $\hat{\beta}_1$ in `r ttcode("R")`. For example, we could implement the formulas presented in Key Concept 4.2 with two of `r ttcode("R")`'s most basic functions: `r ttcode("mean()")` and `r ttcode("sum()")`. Before doing so we *attach* the `r ttcode("CASchools")` dataset.
```{r, echo=-1}
rm(STR)
attach(CASchools) # allows to use the variables contained in CASchools directly
# compute beta_1_hat
beta_1 <- sum((STR - mean(STR)) * (score - mean(score))) / sum((STR - mean(STR))^2)
# compute beta_0_hat
beta_0 <- mean(score) - beta_1 * mean(STR)
# print the results to the console
beta_1
beta_0
```
```{block2, attach, type='rmdknit'}
Calling <tt>attach(CASchools)</tt> enables us to address a variable contained in <tt>CASchools</tt> by its name: it is no longer necessary to use the <tt>$</tt> operator in conjunction with the dataset: <tt>R</tt> may evaluate the variable name directly.
<tt>R</tt> uses the object in the user environment if this object shares the name of variable contained in an attached database. However, it is a better practice to always use distinctive names in order to avoid such (seeming) ambivalences!
```
<br>
**Notice that we address variables contained in the attached dataset <tt>CASchools</tt> directly for the rest of this chapter!**
Of course, there are even more manual ways to perform these tasks. With OLS being one of the most widely-used estimation techniques, `r ttcode("R")` of course already contains a built-in function named `r ttcode("lm()")` (**l**inear **m**odel) which can be used to carry out regression analysis.
The first argument of the function to be specified is, similar to `r ttcode("plot()")`, the regression formula with the basic syntax `r ttcode("y ~ x")` where `r ttcode("y")` is the dependent variable and `r ttcode("x")` the explanatory variable. The argument `r ttcode("data")` determines the data set to be used in the regression. We now revisit the example from the book where the relationship between the test scores and the class sizes is analyzed. The following code uses `r ttcode("lm()")` to replicate the results presented in figure 4.3 of the book.
```{r}
# estimate the model and assign the result to linear_model
linear_model <- lm(score ~ STR, data = CASchools)
# print the standard output of the estimated lm object to the console
linear_model
```
Let us add the estimated regression line to the plot. This time we also enlarge the ranges of both axes by setting the arguments `r ttcode("xlim")` and `r ttcode("ylim")`.
```{r, fig.align='center'}
# plot the data
plot(score ~ STR,
data = CASchools,
main = "Scatterplot of Test Score and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)",
xlim = c(10, 30),
ylim = c(600, 720))
# add the regression line
abline(linear_model)
```
Did you notice that this time, we did not pass the intercept and slope parameters to `r ttcode("abline")`? If you call `r ttcode("abline()")` on an object of class `r ttcode("lm")` which only contains a single regressor, `r ttcode("R")` draws the regression line automatically!
## Measures of Fit
After fitting a linear regression model, a natural question is how well the model describes the data. Visually, this amounts to assessing whether the observations are tightly clustered around the regression line. Both the *coefficient of determination* and the *standard error of the regression* measure how well the OLS Regression line fits the data.
### The Coefficient of Determination {-}
$R^2$, the *coefficient of determination*, is the fraction of the sample variance of $Y_i$ that is explained by $X_i$. Mathematically, the $R^2$ can be written as the ratio of the explained sum of squares to the total sum of squares. The *explained sum of squares* ($ESS$) is the sum of squared deviations of the predicted values $\hat{Y_i}$, from the average of the $Y_i$. The *total sum of squares* ($TSS$) is the sum of squared deviations of the $Y_i$ from their average. Thus we have
\begin{align}
ESS & = \sum_{i = 1}^n \left( \hat{Y_i} - \overline{Y} \right)^2, \\
TSS & = \sum_{i = 1}^n \left( Y_i - \overline{Y} \right)^2, \\
R^2 & = \frac{ESS}{TSS}.
\end{align}
Since $TSS = ESS + SSR$ we can also write
$$ R^2 = 1- \frac{SSR}{TSS}, $$
where $SSR$ is the sum of squared residuals, a measure for the errors made when predicting $Y$ by $X$. The $SSR$ is defined as
$$ SSR = \sum_{i=1}^n \hat{u}_i^2. $$
$R^2$ lies between $0$ and $1$. It is easy to see that a perfect fit, i.e., no errors made when fitting the regression line, implies $R^2 = 1$ then we have $SSR=0$. On the contrary, if our estimated regression line does not explain any variation in the $Y_i$, we have $ESS=0$ and consequently $R^2=0$.
### The Standard Error of the Regression {-}
The *Standard Error of the Regression* ($SER$) is an estimator of the standard deviation of the residuals $\hat{u}_i$. As such it measures the magnitude of a typical deviation from the regression line, i.e., the magnitude of a typical residual.
$$ SER = s_{\hat{u}} = \sqrt{s_{\hat{u}}^2} \ \ \ \text{where} \ \ \ s_{\hat{u} }^2 = \frac{1}{n-2} \sum_{i = 1}^n \hat{u}^2_i = \frac{SSR}{n - 2} $$
Remember that the $u_i$ are *unobserved*. This is why we use their estimated counterparts, the residuals $\hat{u}_i$, instead. See Chapter 4.3 of the book for a more detailed comment on the $SER$.
### Application to the Test Score Data {-}
Both measures of fit can be obtained by using the function `r ttcode("summary()")` with an `r ttcode("lm")` object provided as the only argument. While the function `r ttcode("lm()")` only prints out the estimated coefficients to the console, `r ttcode("summary()")` provides additional predefined information such as the regression's $R^2$ and the $SER$.
```{r}
mod_summary <- summary(linear_model)
mod_summary
```
The $R^2$ in the output is called *Multiple R-squared* and has a value of $0.051$. Hence, $5.1 \%$ of the variance of the dependent variable $score$ is explained by the explanatory variable $STR$. That is, the regression explains little of the variance in $score$, and much of the variation in test scores remains unexplained (cf. Figure 4.3 of the book).
The $SER$ is called *Residual standard error* and equals $18.58$. The unit of the $SER$ is the same as the unit of the dependent variable. That is, on average the deviation of the actual achieved test score and the regression line is $18.58$ points.
Now, let us check whether `r ttcode("summary()")` uses the same definitions for $R^2$ and $SER$ as we do when computing them manually.
```{r}
# compute R^2 manually
SSR <- sum(mod_summary$residuals^2)
TSS <- sum((score - mean(score))^2)
R2 <- 1 - SSR/TSS
# print the value to the console
R2
# compute SER manually
n <- nrow(CASchools)
SER <- sqrt(SSR / (n-2))
# print the value to the console
SER
```
We find that the results coincide. Note that the values provided by `r ttcode("summary()")` are rounded to two decimal places.
## The Least Squares Assumptions {#tlsa}
OLS performs well under a quite broad variety of different circumstances. However, there are some assumptions which need to be satisfied in order to ensure that the estimates are normally distributed in large samples (we discuss this in Chapter \@ref(tsdotoe)).
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC4.3">
<h3 class = "right"> Key Concept 4.3 </h3>
<h3 class = "left"> The Least Squares Assumptions </h3>
<p>
$$Y_i = \\beta_0 + \\beta_1 X_i + u_i \\text{, } i = 1,\\dots,n$$
where
1. The error term $u_i$ has conditional mean zero given $X_i$: $E(u_i|X_i) = 0$.
2. $(X_i,Y_i), i = 1,\\dots,n$ are independent and identically distributed (i.i.d.) draws from their joint distribution.
3. Large outliers are unlikely: $X_i$ and $Y_i$ have nonzero finite fourth moments.
</p>
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[The Least Squares Assumptions]{4.3}
$$Y_i = \\beta_0 + \\beta_1 X_i + u_i \\text{, } i = 1,\\dots,n$$
where
\\begin{enumerate}
\\item The error term $u_i$ has conditional mean zero given $X_i$: $E(u_i|X_i) = 0$.
\\item $(X_i,Y_i), i = 1,\\dots,n$ are independent and identically distributed (i.i.d.) draws from their joint distribution.
\\item Large outliers are unlikely: $X_i$ and $Y_i$ have nonzero finite fourth moments.
\\end{enumerate}
\\end{keyconcepts}
')
```
### Assumption 1: The Error Term has Conditional Mean of Zero {-}
This means that no matter which value we choose for $X$, the error term $u$ must not show any systematic pattern and must have a mean of $0$.
Consider the case that, unconditionally, $E(u) = 0$, but for low and high values of $X$, the error term tends to be positive and for midrange values of
$X$ the error tends to be negative. We can use R to construct such an example. To do so we generate our own data using `r ttcode("R")`'s built-in random number generators.
We will use the following functions:
* `r ttcode("runif()")` - generates uniformly distributed random numbers.
* `r ttcode("rnorm()")` - generates normally distributed random numbers.
* `r ttcode("predict()")` - does predictions based on the results of model fitting functions like `r ttcode("lm()")`.
* `r ttcode("lines()")` - adds line segments to an existing plot.
We start by creating a vector containing values that are uniformly distributed on the interval $[-5,5]$. This can be done with the function `r ttcode("runif()")`. We also need to simulate the error term. For this we generate normally distributed random numbers with a mean of $0$ and a variance of $1$ using `r ttcode("rnorm()")`. The $Y$ values are obtained as a quadratic function of the $X$ values and the error.
After generating the data we estimate both a simple regression model and a quadratic model that also includes the regressor $X^2$ (this is a multiple regression model, see Chapter \@ref(rmwmr)). Finally, we plot the simulated data and add the estimated regression line of a simple regression model as well as the predictions made with a quadratic model to compare the fit graphically.
```{r, fig.align='center'}
# set a seed to make the results reproducible
set.seed(321)
# simulate the data
X <- runif(50, min = -5, max = 5)
u <- rnorm(50, sd = 1)
# the true relation
Y <- X^2 + 2 * X + u
# estimate a simple regression model
mod_simple <- lm(Y ~ X)
# estimate a quadratic regression model
mod_quadratic <- lm( Y ~ X + I(X^2))
# predict using a quadratic model
prediction <- predict(mod_quadratic, data.frame(X = sort(X)))
# plot the results
plot( Y ~ X, col = "black", pch = 20, xlab = "X", ylab = "Y")
abline( mod_simple, col = "blue",lwd=2)
#red line = incorrect linear regression (this violates the first OLS assumption)
lines( sort(X), prediction,col="red",lwd=2)
legend("topleft",
legend = c("Simple Regression Model",
"Quadratic Model"),
cex = 1,
lty = 1,
col = c("blue","red"))
```
The plot above shows what is meant by $E(u_i|X_i) = 0$ and why it does not hold for the linear model:
Using the quadratic model (represented by the black curve) we see that there are no systematic deviations of the observation from the predicted relation. It is credible that the assumption is not violated when such a model is employed. However, using a simple linear regression model we see that the assumption is probably violated as $E(u_i|X_i)$ varies with the $X_i$.
### Assumption 2: Independently and Identically Distributed Data {-}
Most sampling schemes used when collecting data from populations produce i.i.d.-samples. For example, we can use `r ttcode("R")`'s random number generator to randomly select student IDs from a university's enrollment list and record age $X$ and earnings $Y$ of the corresponding students. This is a typical example of simple random sampling and ensures that all the $(X_i, Y_i)$ are drawn randomly from the same population.
A prominent example where the i.i.d. assumption is not fulfilled is time series data where we have observations on the same unit over time. For example, take $X$ as the number of workers in a production company over time. Due to business transformations, the company cuts jobs periodically by a specific share but there are also some non-deterministic influences that relate to economics, politics etc. Using `r ttcode("R")` we can easily simulate such a process and plot it.
We start the series with a total of 5000 workers and simulate the reduction of employment with an autoregressive process that exhibits a downward movement in the long-run and has normally distributed errors:^[See Chapter \@ref(ittsraf) for more on autoregressive processes and time series analysis in general.]
$$ employment_t = -50 + 0.98 \cdot employment_{t-1} + u_t. $$
```{r, fig.align="center"}
# set seed
set.seed(123)
# generate a date vector
Date <- seq(as.Date("1951/1/1"), as.Date("2000/1/1"), "years")
# initialize the employment vector
X <- c(5000, rep(NA, length(Date)-1))
# generate time series observations with random influences
for (t in 2:length(Date)) {
X[t] <- -50 + 0.98 * X[t-1] + rnorm(n = 1, sd = 200)
}
#plot the results
plot(x = Date,
y = X,
type = "l",
col = "steelblue",
ylab = "Workers",
xlab = "Time",
lwd=2)
```
It is evident that the observations on the number of employees cannot be independent in this example: the level of today's employment is correlated with tomorrow's employment level. Thus, the i.i.d. assumption is violated.
### Assumption 3: Large Outliers are Unlikely {-}
It is easy to come up with situations where extreme observations, i.e., observations that deviate considerably from the usual range of the data, may occur. Such observations are called outliers. Technically speaking, assumption 3 requires that $X$ and $Y$ have a finite kurtosis.^[See Chapter 4.4 of the book.]
Common cases where we want to exclude or (if possible) correct such outliers is when they are apparently typos, conversion errors or measurement errors. Even if it seems like extreme observations have been recorded correctly, it is advisable to exclude them before estimating a model since OLS suffers from *sensitivity to outliers*.
What does this mean? One can show that extreme observations receive heavy weighting in the estimation of the unknown regression coefficients when using OLS. Therefore, outliers can lead to strongly distorted estimates of regression coefficients. To get a better impression of this issue, consider the following application where we have placed some sample data on $X$ and $Y$ which are highly correlated. The relation between $X$ and $Y$ seems to be explained pretty well by the plotted regression line: all of the white data points lie close to the red regression line and we have $R^2=0.92$.
Now go ahead and add a further observation at, say, $(18.2)$. This observation clearly is an outlier. The result is quite striking: the estimated regression line differs greatly from the one we adjudged to fit the data well. The slope is heavily downward biased and $R^2$ decreased to a mere $29\%$! <br>
Double-click inside the coordinate system to reset the app. Feel free to experiment. Choose different coordinates for the outlier or add additional ones.
<iframe height="410" width="900" frameborder="0" scrolling="no" src="Outlier.html"></iframe>
The following code roughly reproduces what is shown in figure 4.5 in the book. As done above we use sample data generated using `r ttcode("R")`'s random number functions `r ttcode("rnorm()")` and `r ttcode("runif()")`. We estimate two simple regression models, one based on the original data set and another using a modified set where one observation is changed to be an outlier and then plot the results. In order to understand the complete code you should be familiar with the function `r ttcode("sort()")` which sorts the entries of a numeric vector in ascending order.
```{r, fig.align='center'}
# set seed
set.seed(123)
# generate the data
X <- sort(runif(10, min = 30, max = 70))
Y <- rnorm(10 , mean = 200, sd = 50)
Y[9] <- 2000
# fit model with outlier
fit <- lm(Y ~ X)
# fit model without outlier
fitWithoutOutlier <- lm(Y[-9] ~ X[-9])
# plot the results
plot(Y ~ X,pch=20)
abline(fit,lwd=2,col="blue")
abline(fitWithoutOutlier, col = "red",lwd=2)
legend("topleft",
legend = c("Model with Outlier",
"Model without Outlier"),
cex = 1,
lty = 1,
col = c("blue","red"))
```
## The Sampling Distribution of the OLS Estimator {#tsdotoe}
Because $\hat{\beta}_0$ and $\hat{\beta}_1$ are computed from a sample, the estimators themselves are random variables with a probability distribution --- the so-called sampling distribution of the estimators --- which describes the values they could take on over different samples. Although the sampling distribution of $\hat\beta_0$ and $\hat\beta_1$ can be complicated when the sample size is small and generally changes with the number of observations, $n$, it is possible, provided the assumptions discussed in the book are valid, to make certain statements about it that hold for all $n$. In particular
$$ E(\hat{\beta}_0) = \beta_0 \ \ \text{and} \ \ E(\hat{\beta}_1) = \beta_1,$$
that is, $\hat\beta_0$ and $\hat\beta_1$ are unbiased estimators of $\beta_0$ and $\beta_1$, the true parameters. If the sample is sufficiently large, by the central limit theorem the *joint* sampling distribution of the estimators is well approximated by the bivariate normal distribution (<a href="#mjx-eqn-2.1">2.1</a>). This implies that the marginal distributions are also normal in large samples. Core facts on the large-sample distributions of $\hat\beta_0$ and $\hat\beta_1$ are presented in Key Concept 4.4.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC4.4">
<h3 class = "right"> Key Concept 4.4 </h3>
<h3 class = "left"> Large Sample Distribution of $\\hat\\beta_0$ and $\\hat\\beta_1$ </h3>
<p> If the least squares assumptions in Key Concept 4.3 hold, then in large samples $\\hat\\beta_0$ and $\\hat\\beta_1$ have a joint normal sampling distribution. The large sample normal distribution of $\\hat\\beta_1$ is $\\mathcal{N}(\\beta_1, \\sigma^2_{\\hat\\beta_1})$, where the variance of the distribution, $\\sigma^2_{\\hat\\beta_1}$, is
\\begin{align}
\\sigma^2_{\\hat\\beta_1} = \\frac{1}{n} \\frac{Var \\left[ \\left(X_i - \\mu_X \\right) u_i \\right]} {\\left[ Var \\left(X_i \\right) \\right]^2}. (\\#eq:olsvar1)
\\end{align}
The large sample normal distribution of $\\hat\\beta_0$ is $\\mathcal{N}(\\beta_0, \\sigma^2_{\\hat\\beta_0})$ with
\\begin{align}
\\sigma^2_{\\hat\\beta_0} = \\frac{1}{n} \\frac{Var \\left( H_i u_i \\right)}{ \\left[ E \\left(H_i^2 \\right) \\right]^2 } \\ , \\ \\text{where} \\ \\ H_i = 1 - \\left[ \\frac{\\mu_X} {E \\left( X_i^2\\right)} \\right] X_i. (\\#eq:olsvar2)
\\end{align}
The interactive simulation below continuously generates random samples $(X_i,Y_i)$ of $200$ observations where $E(Y\\vert X) = 100 + 3X$, estimates a simple regression model, stores the estimate of the slope $\\beta_1$ and visualizes the distribution of the $\\widehat{\\beta}_1$s observed so far using a histogram. The idea here is that for a large number of $\\widehat{\\beta}_1$s, the histogram gives a good approximation of the sampling distribution of the estimator. By decreasing the time between two sampling iterations, it becomes clear that the shape of the histogram approaches the characteristic bell shape of a normal distribution centered at the true slope of $3$.
<iframe height="470" width="700" frameborder="0" scrolling="no" src="SmallSampleDIstReg.html"></iframe>
(Double-click on the histogram to restart the simulation.)
</p>
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Large Sample Distribution of $\\hat\\beta_0$ and $\\hat\\beta_1$]{4.4}
If the least squares assumptions in Key Concept 4.3 hold, then in large samples $\\hat\\beta_0$ and $\\hat\\beta_1$ have a joint normal sampling distribution. The large sample normal distribution of $\\hat\\beta_1$ is $\\mathcal{N}(\\beta_1, \\sigma^2_{\\hat\\beta_1})$, where the variance of the distribution, $\\sigma^2_{\\hat\\beta_1}$, is
\\begin{equation}
\\sigma^2_{\\hat\\beta_1} = \\frac{1}{n} \\frac{Var \\left[ \\left(X_i - \\mu_X \\right) u_i \\right]} {\\left[ Var \\left(X_i \\right) \\right]^2}.
\\end{equation}
The large sample normal distribution of $\\hat\\beta_0$ is $\\mathcal{N}(\\beta_0, \\sigma^2_{\\hat\\beta_0})$ with
\\begin{equation}
\\sigma^2_{\\hat\\beta_0} = \\frac{1}{n} \\frac{Var \\left( H_i u_i \\right)}{ \\left[ E \\left(H_i^2 \\right) \\right]^2 } \\ , \\ \\text{where} \\ \\ H_i = 1 - \\left[ \\frac{\\mu_X} {E \\left( X_i^2\\right)} \\right] X_i.
\\end{equation}
The interactive simulation below continuously generates random samples $(X_i,Y_i)$ of $200$ observations where $E(Y\\vert X) = 100 + 3X$, estimates a simple regression model, stores the estimate of the slope $\\beta_1$ and visualizes the distribution of the $\\widehat{\\beta}_1$s observed so far using a histogram. The idea here is that for a large number of $\\widehat{\\beta}_1$s, the histogram gives a good approximation of the sampling distribution of the estimator. By decreasing the time between two sampling iterations, it becomes clear that the shape of the histogram approaches the characteristic bell shape of a normal distribution centered at the true slope of $3$.\\vspace{0.5cm}
\\begin{center}\\textit{This interactive part of the book is only available in the HTML version.}\\end{center}
\\end{keyconcepts}
')
```
### Simulation Study 1 {-}
Whether the statements of Key Concept 4.4 really hold can also be verified using `r ttcode("R")`. For this we first build our own population of $100000$ observations in total. To do this we need values for the independent variable $X$, for the error term $u$, and for the parameters $\beta_0$ and $\beta_1$. With these combined in a simple regression model, we compute the dependent variable $Y$. <br> In our example we generate the numbers $X_i$, $i = 1$, ... ,$100000$ by drawing a random sample from a uniform distribution on the interval $[0,20]$. The realizations of the error terms $u_i$ are drawn from a standard normal distribution with parameters $\mu = 0$ and $\sigma^2 = 100$ (note that `r ttcode("rnorm()")` requires $\sigma$ as input for the argument `r ttcode("sd")`, see `r ttcode("?rnorm")`). Furthermore we chose $\beta_0 = -2$ and $\beta_1 = 3.5$ so the true model is
$$ Y_i = -2 + 3.5 \cdot X_i. $$
Finally, we store the results in a data.frame.
```{r}
# simulate data
N <- 100000
X <- runif(N, min = 0, max = 20)
u <- rnorm(N, sd = 10)
# population regression
Y <- -2 + 3.5 * X + u
population <- data.frame(X, Y)
```
From now on we will consider the previously generated data as the true population (which of course would be *unknown* in a real world application, otherwise there would be no reason to draw a random sample in the first place). The knowledge about the true population and the true relationship between $Y$ and $X$ can be used to verify the statements made in Key Concept 4.4.
First, let us calculate the true variances $\sigma^2_{\hat{\beta}_0}$ and $\sigma^2_{\hat{\beta}_1}$ for a randomly drawn sample of size $n = 100$.
```{r, fig.align='center'}
# set sample size
n <- 100
# compute the variance of beta_hat_0
H_i <- 1 - mean(X) / mean(X^2) * X
var_b0 <- var(H_i * u) / (n * mean(H_i^2)^2 )
# compute the variance of hat_beta_1
var_b1 <- var( ( X - mean(X) ) * u ) / (n * var(X)^2)
```
```{r}
# print variances to the console
var_b0
var_b1
```
Now let us assume that we do not know the true values of $\beta_0$ and $\beta_1$ and that it is not possible to observe the whole population. However, we can observe a random sample of $n$ observations. Then, it would not be possible to compute the true parameters but we could obtain estimates of $\beta_0$ and $\beta_1$ from the sample data using OLS. However, we know that these estimates are outcomes of random variables themselves since the observations are randomly sampled from the population. Key Concept 4.4 describes their distributions for large $n$. When drawing a single sample of size $n$ it is not possible to make any statement about these distributions. Things change if we repeat the sampling scheme many times and compute the estimates for each sample: using this procedure we simulate outcomes of the respective distributions.
To achieve this in R, we employ the following approach:
- We assign the number of repetitions, say $10000$, to `r ttcode("reps")` and then initialize a matrix `r ttcode("fit")` where the estimates obtained in each sampling iteration shall be stored row-wise. Thus `r ttcode("fit")` has to be a matrix of dimensions `r ttcode("reps")`$\times2$.
- In the next step we draw `r ttcode("reps")` random samples of size `r ttcode("n")` from the population and obtain the OLS estimates for each sample. The results are stored as row entries in the outcome matrix `r ttcode("fit")`. This is done using a `r ttcode("for()")` loop.
- At last, we estimate variances of both estimators using the sampled outcomes and plot histograms of the latter. We also add a plot of the density functions belonging to the distributions that follow from Key Concept 4.4. The function `r ttcode("bquote()")` is used to obtain math expressions in the titles and labels of both plots. See `r ttcode("?bquote")`.
```{r, cache=T}
# set repetitions and sample size
n <- 100
reps <- 10000
# initialize the matrix of outcomes
fit <- matrix(ncol = 2, nrow = reps)
# loop sampling and estimation of the coefficients
for (i in 1:reps){
sample <- population[sample(1:N, n), ]
fit[i, ] <- lm(Y ~ X, data = sample)$coefficients
}
# compute variance estimates using outcomes
var(fit[, 1])
var(fit[, 2])
```
```{r, fig.align='center'}
# divide plotting area as 1-by-2 array
par(mfrow = c(1, 2))
# plot histograms of beta_0 estimates
hist(fit[, 1],
cex.main = 0.8,
main = bquote(The ~ Distribution ~ of ~ 10000 ~ beta[0] ~ Estimates),
xlab = bquote(hat(beta)[0]),
freq = F)
# add true distribution to plot
curve(dnorm(x,
-2,
sqrt(var_b0)),
add = T,
col = "darkred",lwd=2)
# plot histograms of beta_hat_1
hist(fit[, 2],
cex.main = 0.8,
main = bquote(The ~ Distribution ~ of ~ 10000 ~ beta[1] ~ Estimates),
xlab = bquote(hat(beta)[1]),
freq = F)
# add true distribution to plot
curve(dnorm(x,
3.5,
sqrt(var_b1)),
add = T,
col = "darkred",lwd=2)
```
Our variance estimates support the statements made in Key Concept 4.4, coming close to the theoretical values. The histograms suggest that the distributions of the estimators can be well approximated by the respective theoretical normal distributions stated in Key Concept 4.4.
### Simulation Study 2 {-}
A further result implied by Key Concept 4.4 is that both estimators are consistent, i.e., they converge in probability to the true parameters we are interested in. This is because they are asymptotically unbiased and their variances converge to $0$ as $n$ increases. We can check this by repeating the simulation above for a sequence of increasing sample sizes. This means we no longer assign the sample size but a *vector* of sample sizes: `r ttcode("n <- c(...)")`. <br>
Let us look at the distributions of $\beta_1$. The idea here is to add an additional call for `r ttcode("for()")` to the code. This is done in order to loop over the vector of sample sizes `r ttcode("n")`. For each of the sample sizes we carry out the same simulation as before but plot a density estimate for the outcomes of each iteration over `r ttcode("n")`. Notice that we have to change `r ttcode("n")` to `r ttcode("n[j]")` in the inner loop to ensure that the `r ttcode("j")`$^{th}$ element of `r ttcode("n")` is used. In the simulation, we use sample sizes of $100, 250, 1000$ and $3000$. Consequently we have a total of four distinct simulations using different sample sizes.
```{r, fig.align='center', cache=T,fig.width=8, fig.height=8}
# set seed for reproducibility
set.seed(1)
# set repetitions and the vector of sample sizes
reps <- 1000
n <- c(100, 250, 1000, 3000)
# initialize the matrix of outcomes
fit <- matrix(ncol = 2, nrow = reps)
# divide the plot panel in a 2-by-2 array
par(mfrow = c(2, 2))
# loop sampling and plotting
# outer loop over n
for (j in 1:length(n)) {
# inner loop: sampling and estimating of the coefficients
for (i in 1:reps){
sample <- population[sample(1:N, n[j]), ]
fit[i, ] <- lm(Y ~ X, data = sample)$coefficients
}
# draw density estimates
plot(density(fit[ ,2]), xlim=c(2.5, 4.5),
col = j,
main = paste("n=", n[j]),
xlab = bquote(hat(beta)[1]))
}
```
We find that, as $n$ increases, the distribution of $\hat\beta_1$ concentrates around its mean, i.e., its variance decreases. Put differently, the likelihood of observing estimates close to the true value of $\beta_1 = 3.5$ grows as we increase the sample size. The same behavior can be observed if we analyze the distribution of $\hat\beta_0$ instead.
### Simulation Study 3 {-}
Furthermore, (4.1) reveals that the variance of the OLS estimator for $\beta_1$ decreases as the variance of the $X_i$ increases. In other words, as we increase the amount of information provided by the regressor, that is, increasing $Var(X)$, which is used to estimate $\beta_1$, we become more confident that the estimate is close to the true value (i.e., $Var(\hat\beta_1)$ decreases).<br>
We can visualize this by reproducing Figure 4.6 from the book. To do this, we sample observations $(X_i,Y_i)$, $i=1,\dots,100$ from a bivariate normal distribution with
$$E(X)=E(Y)=5,$$
$$Var(X)=Var(Y)=5,$$
and
$$Cov(X,Y)=4.$$
Formally, this is written down as
$$\begin{pmatrix} X \\ Y \end{pmatrix}\overset{i.i.d.}{\sim} \ \mathcal{N}\left[\begin{pmatrix} 5 \\ 5 \end{pmatrix}, \begin{pmatrix} 5 & 4 \\ 4 & 5 \end{pmatrix} \right].\tag{4.3} $$
To carry out the random sampling, we make use of the function `r ttcode("mvrnorm()")` from the package `r ttcode("MASS")` [@R-MASS] which allows to draw random samples from multivariate normal distributions, see `?mvtnorm`. Next, we use `r ttcode("subset()")` to split the sample into two subsets such that the first set, `r ttcode("set1")`, consists of observations that fulfill the condition $\lvert X - \overline{X} \rvert > 1$ and the second set, `r ttcode("set2")`, includes the remainder of the sample. We then plot both sets and use different colors to distinguish the observations.
```{r, fig.align = 'center', cache = T}
# load the MASS package
library(MASS)
# set seed for reproducibility
set.seed(4)
# simulate bivariate normal data
bvndata <- mvrnorm(100,
mu = c(5, 5),
Sigma = cbind(c(5, 4), c(4, 5)))
# assign column names / convert to data.frame
colnames(bvndata) <- c("X", "Y")
bvndata <- as.data.frame(bvndata)
# subset the data
set1 <- subset(bvndata, abs(mean(X) - X) > 1)
set2 <- subset(bvndata, abs(mean(X) - X) <= 1)
# plot both data sets
plot(set1,
xlab = "X",
ylab = "Y",
pch = 19)
points(set2,
col = "steelblue",
pch = 19)
legend("topleft",
legend = c("Set1",
"Set2"),
cex = 1,
pch = 19,
col = c("black","steelblue"))
```
It is clear that observations that are close to the sample average of the $X_i$ have less variance than those that are farther away. Now, if we were to draw a line as accurately as possible through either of the two sets it is intuitive that choosing the observations indicated by the black dots, i.e., using the set of observations which has larger variance than the blue ones, would result in a more precise line. Now, let us use OLS to estimate slope and intercept for both sets of observations. We then plot the observations along with both regression lines.
```{r, fig.align='center'}
# estimate both regression lines
lm.set1 <- lm(Y ~ X, data = set1)
lm.set2 <- lm(Y ~ X, data = set2)
# plot observations
plot(set1, xlab = "X", ylab = "Y", pch = 19)
points(set2, col = "steelblue", pch = 19)
# add both lines to the plot
abline(lm.set1, col = "black",lwd=2)
abline(lm.set2, col = "steelblue",lwd=2)
legend("bottomright",
legend = c("Set1",
"Set2"),
cex = 1,
lwd=2,
col = c("black","steelblue"))
```
Evidently, the green regression line does far better in describing data sampled from the bivariate normal distribution stated in (<a href="#mjx-eqn-4.3">4.3</a>) than the red line. This is a nice example for demonstrating why we are interested in a high variance of the regressor $X$: more variance in the $X_i$ means more information from which the precision of the estimation benefits.
## Exercises {#exercises-4}
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 1. Class Sizes and Test Scores {-}
A researcher wants to analyze the relationship between class size (measured by the student-teacher ratio) and the average test score. Therefore he measures both variables in $10$ different classes and ends up with the following results.
<!--html_preserve-->
<table>
<tr>
<td><b>Class Size</b></td>
<td>23</td>
<td>19</td>
<td>30</td>
<td>22</td>
<td>23</td>
<td>29</td>
<td>35</td>
<td>36</td>
<td>33</td>
<td>25</td>
</tr>
<tr>
<td><b>Test Score</b></td>
<td>430</td>
<td>430</td>
<td>333</td>
<td>410</td>
<td>390</td>
<td>377</td>
<td>325</td>
<td>310</td>
<td>328</td>
<td>375</td>
</tr>
</table>
<!--/html_preserve-->
**Instructions:**
+ Create the vectors <tt>cs</tt> (the class size) and <tt>ts</tt> (the test score), containing the observations above.
+ Draw a scatterplot of the results using <tt>plot()</tt>.
<iframe src="DCL/ex4_1.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
</div>')
} else {
cat('\\begin{center}\\textit{This interactive part of the book is only available in the HTML version.}\\end{center}')
}
```
```{r, echo=F, purl=F, results='asis'}
if (my_output=='html') {
cat('
<div class = "DCexercise">
#### 2. Mean, Variance, Covariance and Correlation {-}
The vectors <tt>cs</tt> and <tt>ts</tt> are available in the working environment (you can check this: type their names into the console and press enter).
**Instructions:**
+ Compute the mean, the sample variance and the sample standard deviation of <tt>ts</tt>.
+ Compute the covariance and the correlation coefficient for <tt>ts</tt> and <tt>cs</tt>.
<iframe src="DCL/ex4_2.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
**Hint:** Use the <tt>R</tt> functions presented in this chapter: <tt>mean()</tt>, <tt>sd()</tt>, <tt>cov()</tt>, <tt>cor()</tt> and <tt>var()</tt>.
</div>')}
```
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 3. Simple Linear Regression {-}
The vectors <tt>cs</tt> and <tt>ts</tt> are available in the working environment.
**Instructions:**
+ The function <tt>lm()</tt> is part of the package <tt>AER</tt>. Attach the package using <tt>library()</tt>.
+ Use <tt>lm()</tt> to estimate the regression model $$TestScore_i = \\beta_0 + \\beta_1 STR_i + u_i.$$ Assign the result to <tt>mod</tt>.
+ Obtain a statistical summary of the model.
<iframe src="DCL/ex4_3.html" frameborder="0" scrolling="no" style="width:100%;height:340px"></iframe>
</div>')}
```
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 4. The Model Object {-}