-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy path10-sample-designs-replicate-weights.Rmd
848 lines (600 loc) · 53.8 KB
/
10-sample-designs-replicate-weights.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
# (PART) Real-life data {-}
# Sample designs and replicate weights {#c10-sample-designs-replicate-weights}
```{r}
#| label: samp-styler
#| include: false
knitr::opts_chunk$set(tidy = 'styler')
```
::: {.prereqbox-header}
`r if (knitr:::is_html_output()) '### Prerequisites {- #prereq3}'`
:::
::: {.prereqbox data-latex="{Prerequisites}"}
For this chapter, load the following packages:
```{r}
#| label: samp-setup-libraries
#| error: FALSE
#| warning: FALSE
#| message: FALSE
library(tidyverse)
library(survey)
library(srvyr)
library(srvyrexploR)
```
To help explain the different types of sample designs, this chapter uses the `api` and `scd` data that are included in the {survey} package [@lumley2010complex]:
```{r}
#| label: samp-setup-surveydata
data(api)
data(scd)
```
This chapter uses data from the Residential Energy Consumption Survey (RECS), both 2015 and 2020, so we load the RECS data from the {srvyrexploR} package using their object names `recs_2015` and `recs_2020`, respectively [@R-srvyrexploR].
:::
## Introduction
The primary reason for using packages like {survey} and {srvyr} is to account for the sampling design or replicate weights into point and uncertainty estimates [@R-srvyr; @lumley2010complex]. By incorporating the sampling design or replicate weights, these estimates are appropriately calculated.
In this chapter, we introduce common sampling designs and common types of replicate weights, the mathematical methods for calculating estimates and standard errors for a given sampling design, and the R syntax to specify the sampling design or replicate weights. While we show the math behind the estimates, the functions in these packages handle the calculation. To deeply understand the math and the derivation, refer to @pennstate506, @sarndal2003model, @wolter2007introduction, or @fuller2011sampling (these are listed in order of increasing statistical rigorousness).
\index{Survey analysis process|(}The general process for estimation in the {srvyr} package is to:
1. Create a `tbl_svy` object (a survey object) using: `as_survey_design()` or `as_survey_rep()`
2. Subset data (if needed) using `filter()` (subpopulations)
3. Specify domains of analysis using `group_by()`
4. Within `summarize()`, specify variables to calculate, including means, totals, proportions, quantiles, and more
This chapter includes details on the first step: creating the survey object. Once this survey object is created, it can be used in the other steps (detailed in Chapters \@ref(c05-descriptive-analysis) through \@ref(c07-modeling)) to account for the complex survey design.\index{Survey analysis process|)}
## Common sampling designs
A sampling design is the method used to draw a sample. Both logistical and statistical elements are considered when developing a sampling design. When specifying a sampling design in R, we specify the levels of sampling along with the weights. The weight for each record is constructed so that the particular record represents that many units in the population. For example, in a survey of 6th-grade students in the United States, the weight associated with each responding student reflects how many 6th-grade students across the country that record represents. Generally, the weights represent the inverse of the probability of selection, such that the sum of the weights corresponds to the total population size, although some studies may have the sum of the weights equal to the number of respondent records.
Some common terminology across the designs are:
- sample size, generally denoted as $n$, is the number of units selected to be sampled
- population size, generally denoted as $N$, is the number of units in the population of interest
- \index{Sampling frame|(}sampling frame, the list of units from which the sample is drawn (see Chapter \@ref(c02-overview-surveys) for more information)\index{Sampling frame|)}
### Simple random sample without replacement
\index{Simple random sampling|(}
The simple random sample (SRS) without replacement is a sampling design in which a fixed sample size is selected from a sampling frame, and every possible subsample has an equal probability of selection. Without replacement refers to the fact that once a sampling unit has been selected, it is removed from the sample frame and cannot be selected again.
- Requirements: The sampling frame must include the entire population.
- Advantages: SRS requires no information about the units apart from contact information.
- Disadvantages: The sampling frame may not be available for the entire population.
- Example: Randomly select students in a university from a roster provided by the registrar's office.
#### The math {-}
The estimate for the population mean of variable $y$ is:
$$\bar{y}=\frac{1}{n}\sum_{i=1}^n y_i$$
where $\bar{y}$ represents the sample mean, $n$ is the total number of respondents (or observations), and $y_i$ is each individual value of $y$.
The estimate of the standard error of the mean is:
$$se(\bar{y})=\sqrt{\frac{s^2}{n}\left( 1-\frac{n}{N} \right)}$$ where
$$s^2=\frac{1}{n-1}\sum_{i=1}^n\left(y_i-\bar{y}\right)^2.$$
\index{Finite population correction|(} \index{fpc|see {Finite population correction}}
and $N$ is the population size. This standard error estimate might look very similar to equations in other statistical applications except for the part on the right side of the equation: $1-\frac{n}{N}$. This is called the finite population correction (FPC) factor. If the size of the frame, $N$, is very large in comparison to the sample, the FPC is negligible, so it is often ignored. A common guideline is if the sample is less than 10% of the population, the FPC is negligible.
To estimate proportions, we define $x_i$ as the indicator if the outcome is observed. That is, $x_i=1$ if the outcome is observed, and $x_i=0$ if the outcome is not observed for respondent $i$. Then the estimated proportion from an SRS design is:
$$\hat{p}=\frac{1}{n}\sum_{i=1}^n x_i $$
and the estimated standard error of the proportion is:
$$se(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n-1}\left(1-\frac{n}{N}\right)} $$
#### The syntax {-}
\index{Functions in srvyr!as\_survey\_design|(}
If a sample was drawn through SRS and had no nonresponse or other weighting adjustments, we specify this design in R as:
```r
srs1_des <- dat %>%
as_survey_design(fpc = fpcvar)
```
where `dat` is a tibble or data.frame with the survey data, and `fpcvar` is a variable in the data indicating the sampling frame's size (this variable has the same value for all cases in an SRS design). If the frame is very large, sometimes the frame size is not provided. In that case, the FPC is not needed, and we specify the design as:
```r
srs2_des <- dat %>%
as_survey_design()
```
If some post-survey adjustments were implemented and the weights are not all equal, we specify the design as:
```r
srs3_des <- dat %>%
as_survey_design(weights = wtvar,
fpc = fpcvar)
```
where `wtvar` is a variable in the data indicating the weight for each case. Again, the FPC can be omitted if it is unnecessary because the frame is large compared to the sample size.
#### Example {-}
The {survey} package in R provides some example datasets that we use throughout this chapter. One of the example datasets we use is from the Academic Performance Index Program (APIP). The APIP program administered by the California Department of Education, and the {survey} package includes a population file (sample frame) of all schools with at least 100 students and several different samples pulled from that data using different sampling methods. For this first example, we use the `apisrs` dataset, which contains an SRS of 200 schools. For printing purposes, we create a new dataset called `apisrs_slim`, which sorts the data by the school district and school ID and subsets the data to only a few columns. The SRS sample data are illustrated below:
```{r}
#| label: samp-des-apisrs-display
#| message: false
apisrs_slim <-
apisrs %>%
as_tibble() %>%
arrange(dnum, snum) %>%
select(cds, dnum, snum, dname, sname, fpc, pw)
apisrs_slim
```
Table \@ref(tab:apidata) provides details on all the variables in this dataset.
Table: (\#tab:apidata) Overview of Variables in APIP Data
Variable Name | Description
:--: | --------
`cds` | Unique identifier for each school
`dnum` | School district identifier within county
`snum` | School identifier within district
`dname` | District Name
`sname` | School Name
`fpc` | Finite population correction factor
`pw` | Weight
To create the `tbl_survey` object for the SRS data, we specify the design as:
```{r}
#| label: samp-des-apisrs-des
apisrs_des <- apisrs_slim %>%
as_survey_design(weights = pw,
fpc = fpc)
apisrs_des
```
In the printed design object, the design is described as an "Independent Sampling design," which is another term for SRS. The ids are specified as `1`, which means there is no clustering (a topic described in Section \@ref(samp-cluster)), the FPC variable is indicated, and the weights are indicated. We can also look at the summary of the design object (`summary()`) and see the distribution of the probabilities (inverse of the weights) along with the population size and a list of the variables in the dataset.
```{r}
#| label: samp-des-apisrs-summary
summary(apisrs_des)
```
\index{Finite population correction|)}
### Simple random sample with replacement
Similar to the SRS design, the simple random sample with replacement (SRSWR) design randomly selects the sample from the entire sampling frame. However, while SRS removes sampled units before selecting again, the SRSWR instead replaces each sampled unit before drawing again, so units can be selected more than once.
- Requirements: The sampling frame must include the entire population.
- Advantages: SRSWR requires no information about the units apart from contact information.
- Disadvantages:
- The sampling frame may not be available for the entire population.
- Units can be selected more than once, resulting in a smaller realized sample size because receiving duplicate information from a single respondent does not provide additional information.
- For small populations, SRSWR has larger standard errors than SRS designs.
- Example: A professor puts all students' names on paper slips and selects them randomly to ask students questions, but the professor replaces the paper after calling on the student so they can be selected again at any time.
In general for surveys, using an SRS design (without replacement) is preferred as we do not want respondents to answer a survey more than once.
#### The math {-}
The estimate for the population mean of variable $y$ is:
$$\bar{y}=\frac{1}{n}\sum_{i=1}^n y_i$$
and the estimate of the standard error of mean is:
$$se(\bar{y})=\sqrt{\frac{s^2}{n}}$$ where
$$s^2=\frac{1}{n-1}\sum_{i=1}^n\left(y_i-\bar{y}\right)^2.$$
To calculate the estimated proportion, we define $x_i$ as the indicator that the outcome is observed (as we did with SRS):
$$\hat{p}=\frac{1}{n}\sum_{i=1}^n x_i $$
and the estimated standard error of the proportion is:
$$se(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $$
#### The syntax {-}
If we had a sample that was drawn through SRSWR and had no nonresponse or other weighting adjustments, in R, we specify this design as:
```r
srswr1_des <- dat %>%
as_survey_design()
```
\index{Finite population correction|(}
where `dat` is a tibble or data.frame containing our survey data. This syntax is the same as an SRS design, except an FPC is not included. This is because when calculating a sample with replacement, the population pool to select from is no longer finite, so a correction is not needed. Therefore, with large populations where the FPC is negligible, the underlying formulas for SRS and SRSWR designs are the same.
\index{Finite population correction|)}
If some post-survey adjustments were implemented and the weights are not all equal, we specify the design as:
```r
srswr2_des <- dat %>%
as_survey_design(weights = wtvar)
```
where `wtvar` is the variable for the weight of the data.
#### Example {-}
The {survey} package does not include an example of SRSWR. To illustrate this design, we need to create an example. We use the APIP population data provided by the {survey} package (`apipop`) and select a sample of 200 cases using the `slice_sample()` function from the tidyverse. One of the arguments in the `slice_sample()` function is `replace`. If `replace=TRUE`, then we are conducting an SRSWR. We then calculate selection weights as the inverse of the probability of selection and call this new dataset `apisrswr`.
```{r}
#| label: samp-des-apisrs-wr-display
set.seed(409963)
apisrswr <- apipop %>%
as_tibble() %>%
slice_sample(n = 200, replace = TRUE) %>%
select(cds, dnum, snum, dname, sname) %>%
mutate(weight = nrow(apipop) / 200)
head(apisrswr)
```
Because this is an SRS design with replacement, there may be duplicates in the data. It is important to keep the duplicates in the data for proper estimation. For reference, we can view the duplicates in the example data we just created.
```{r}
#| label: samp-des-apisrs-wr-duplicates
apisrswr %>%
group_by(cds) %>%
filter(n() > 1) %>%
arrange(cds)
```
We created a weight variable in this example data, which is the inverse of the probability of selection. We specify the sampling design for `apisrswr` as:
```{r}
#| label: samp-des-apisrswr-des
apisrswr_des <- apisrswr %>%
as_survey_design(weights = weight)
apisrswr_des
summary(apisrswr_des)
```
\index{Finite population correction|(}
In the output above, the design object and the object summary are shown. Both note that the sampling is done "with replacement" because no FPC was specified. The probabilities, which are derived from the weights, are summarized in the summary function output.
\index{Finite population correction|)}\index{Simple random sampling|)}
### Stratified sampling \index{Stratified sampling|(} \index{Strata|(}
Stratified sampling occurs when a population is divided into mutually exclusive subpopulations (strata), and then samples are selected independently within each stratum.
- Requirements: The sampling frame must include the information to divide the population into strata for every unit.
- Advantages:
- This design ensures sample representation in all subpopulations.
- If the strata are correlated with survey outcomes, a stratified sample has smaller standard errors compared to a SRS sample of the same size.
- This results in a more efficient design.
- Disadvantages: Auxiliary data may not exist to divide the sampling frame into strata, or the data may be outdated.
- Examples:
- Example 1: A population of North Carolina residents could be stratified into urban and rural areas, and then an SRS of residents from both rural and urban areas is selected independently. This ensures there are residents from both areas in the sample.
- Example 2: Law enforcement agencies could be stratified into the three primary general-purpose categories in the U.S.: local police, sheriff's departments, and state police. An SRS of agencies from each of the three types is then selected independently to ensure all three types of agencies are represented.
#### The math {-}
Let $\bar{y}_h$ be the sample mean for stratum $h$, $N_h$ be the population size of stratum $h$, $n_h$ be the sample size of stratum $h$, and $H$ be the total number of strata. Then, the estimate for the population mean under stratified SRS sampling is:
$$\bar{y}=\frac{1}{N}\sum_{h=1}^H N_h\bar{y}_h$$
and the estimate of the standard error of $\bar{y}$ is:
$$se(\bar{y})=\sqrt{\frac{1}{N^2} \sum_{h=1}^H N_h^2 \frac{s_h^2}{n_h}\left(1-\frac{n_h}{N_h}\right)} $$
where
$$s_h^2=\frac{1}{n_h-1}\sum_{i=1}^{n_h}\left(y_{i,h}-\bar{y}_h\right)^2$$
For estimates of proportions, let $\hat{p}_h$ be the estimated proportion in stratum $h$. Then, the population proportion estimate is:
$$\hat{p}= \frac{1}{N}\sum_{h=1}^H N_h \hat{p}_h$$
The standard error of the proportion is:
$$se(\hat{p}) = \frac{1}{N} \sqrt{ \sum_{h=1}^H N_h^2 \frac{\hat{p}_h(1-\hat{p}_h)}{n_h-1} \left(1-\frac{n_h}{N_h}\right)}$$
#### The syntax {-}
\index{Finite population correction|(}
In addition to the `fpc` and `weights` arguments discussed in the types above, stratified designs require the addition of the `strata` argument. For example, to specify a stratified SRS design in {srvyr} when using the FPC, that is, where the population sizes of the strata are not too large and are known, we specify the design as:
```r
stsrs1_des <- dat %>%
as_survey_design(fpc = fpcvar,
strata = stratavar)
```
where `fpcvar` is a variable on our data that indicates $N_h$ for each row, and `stratavar` is a variable indicating the stratum for each row. We can omit the FPC if it is not applicable. Additionally, we can indicate the weight variable if it is present where `wtvar` is a variable on our data with a numeric weight.
```r
stsrs2_des <- dat %>%
as_survey_design(weights = wtvar,
strata = stratavar)
```
#### Example {-}
In the example APIP data, `apistrat` is a stratified random sample, stratified by school type (`stype`) with three levels: `E` for elementary school, `M` for middle school, and `H` for high school. As with the SRS example above, we sort and select specific variables for use in printing. The data are illustrated below, including a count of the number of cases per stratum:
```{r}
#| label: samp-des-apistrat-dis
apistrat_slim <-
apistrat %>%
as_tibble() %>%
arrange(dnum, snum) %>%
select(cds, dnum, snum, dname, sname, stype, fpc, pw)
apistrat_slim %>%
count(stype, fpc)
```
The FPC is the same for each case within each stratum. This output also shows that 100 elementary schools, 50 middle schools, and 50 high schools were sampled. It is often common for the number of units sampled from each strata to be different based on the goals of the project, or to mirror the size of each strata in the population. We specify the design as:
```{r}
#| label: samp-des-apistrat-des
apistrat_des <- apistrat_slim %>%
as_survey_design(strata = stype,
weights = pw,
fpc = fpc)
apistrat_des
summary(apistrat_des)
```
When printing the object, it is specified as a "Stratified Independent Sampling design," also known as a stratified SRS, and the strata variable is included. Printing the summary, we see a distribution of probabilities, as we saw with SRS; but we also see the sample and population sizes by stratum. \index{Finite population correction|)} \index{Stratified sampling|)} \index{Strata|)}
### Clustered sampling {#samp-cluster}
\index{Clustered sampling|(} \index{Primary sampling unit|(}
Clustered sampling occurs when a population is divided into mutually exclusive subgroups called clusters or primary sampling units (PSUs). A random selection of PSUs is sampled, and then another level of sampling is done within these clusters. There can be multiple levels of this selection. \index{Data collection|(}Clustered sampling is often used when a list of the entire population is not available or data collection involves interviewers needing direct contact with respondents.
- Requirements: There must be a way to divide the population into clusters. Clusters are commonly structural, such as institutions (e.g., schools, prisons) or geography (e.g., states, counties).
- Advantages:
- Clustered sampling is advantageous when data collection is done in person, so interviewers are sent to specific sampled areas rather than completely at random across a country. \index{Data collection|)}
- With clustered sampling, a list of the entire population is not necessary. For example, if sampling students, we do not need a list of all students, but only a list of all schools. Once the schools are sampled, lists of students can be obtained within the sampled schools.
- Disadvantages: Compared to a simple random sample for the same sample size, clustered samples generally have larger standard errors of estimates.
- Examples:
- Example 1: Consider a study needing a sample of 6th-grade students in the United States. No list likely exists of all these students. However, it is more likely to obtain a list of schools that enroll 6th graders, so a study design could select a random sample of schools that enroll 6th graders. The selected schools can then provide a list of students to do a second stage of sampling where 6th-grade students are randomly sampled within each of the sampled schools. This is a one-stage sample design (the one representing the number of clusters) and is the type of design we discuss in the formulas below.
- Example 2: Consider a study sending interviewers to households for a survey. This is a more complicated example that requires two levels of clustering (two-stage sample design) to efficiently use interviewers in geographic clusters. First, in the U.S., counties could be selected as the PSU and then census block groups within counties could be selected as the secondary sampling unit (SSU). Households could then be randomly sampled within the block groups. This type of design is popular for in-person surveys, as it reduces the travel necessary for interviewers.
#### The math {-}
Consider a survey where $a$ clusters are sampled from a population of $A$ clusters via SRS. Within each sampled cluster, $i$, there are $B_i$ units in the population, and $b_i$ units are sampled via SRS. Let $\bar{y}_{i}$ be the sample mean of cluster $i$. Then, a ratio estimator of the population mean is:
$$\bar{y}=\frac{\sum_{i=1}^a B_i \bar{y}_{i}}{ \sum_{i=1}^a B_i}$$
Note this is a consistent but biased estimator. Often the population size is not known, so this is a method to estimate a mean without knowing the population size. The estimated standard error of the mean is:
$$se(\bar{y})= \frac{1}{\hat{N}}\sqrt{\left(1-\frac{a}{A}\right)\frac{s_a^2}{a} + \frac{A}{a} \sum_{i=1}^a \left(1-\frac{b_i}{B_i}\right) \frac{s_i^2}{b_i} }$$
where $\hat{N}$ is the estimated population size, $s_a^2$ is the between-cluster variance, and $s_i^2$ is the within-cluster variance.
The formula for the between-cluster variance ($s_a^2$) is:
$$s_a^2=\frac{1}{a-1}\sum_{i=1}^a \left( \hat{y}_i - \frac{\sum_{i=1}^a \hat{y}_{i} }{a}\right)^2$$
where $\hat{y}_i =B_i\bar{y_i}$.
The formula for the within-cluster variance ($s_i^2$) is:
$$s_i^2=\frac{1}{a(b_i-1)} \sum_{j=1}^{b_i} \left(y_{ij}-\bar{y}_i\right)^2$$
where $y_{ij}$ is the outcome for sampled unit $j$ within cluster $i$.
#### The syntax {-}
\index{Finite population correction|(}
Clustered sampling designs require the addition of the `ids` argument, which specifies the cluster level variable(s). To specify a two-stage clustered design without replacement, we specify the design as:
```r
clus2_des <- dat %>%
as_survey_design(weights = wtvar,
ids = c(PSU, SSU),
fpc = c(A, B))
```
where `PSU` and `SSU` are the variables indicating the PSU and SSU identifiers, and `A` and `B` are the variables indicating the population sizes for each level (i.e., `A` is the number of clusters, and `B` is the number of units within each cluster). Note that `A` is the same for all records, and `B` is the same for all records within the same cluster.
If clusters were sampled with replacement or from a very large population, the FPC is unnecessary. Additionally, only the first stage of selection is necessary regardless of whether the units were selected with replacement at any stage. The subsequent stages of selection are ignored in computation as their contribution to the variance is overpowered by the first stage (see @sarndal2003model or @wolter2007introduction for a more in-depth discussion). Therefore, the two design objects specified below yield the same estimates in the end:
\index{Finite population correction|)}
```r
clus2ex1_des <- dat %>%
as_survey_design(weights = wtvar,
ids = c(PSU, SSU))
clus2ex2_des <- dat %>%
as_survey_design(weights = wtvar,
ids = PSU)
```
\index{Strata|(} \index{Primary sampling unit|(}
Note that there is one additional argument that is sometimes necessary, which is `nest = TRUE`. This option relabels cluster IDs to enforce nesting within strata. Sometimes, as an example, there may be a cluster `1` within each stratum, but cluster `1` in stratum `1` is a different cluster than cluster `1` in stratum `2`. These are actually different clusters. This option indicates that repeated numbering does not mean it is the same cluster. If this option is not used and there are repeated cluster IDs across different strata, an error is generated.
\index{Strata|)} \index{Primary sampling unit|)}
#### Example {-}
\index{Finite population correction|(}
The `survey` package includes a two-stage cluster sample data, `apiclus2`, in which school districts were sampled, and then a random sample of five schools was selected within each district. For districts with fewer than five schools, all schools were sampled. School districts are identified by `dnum`, and schools are identified by `snum`. The variable `fpc1` indicates how many districts there are in California (the total number of PSUs or `A`), and `fpc2` indicates how many schools were in a given district with at least 100 students (the total number of SSUs or `B`). The data include a row for each school. In the data printed below, there are 757 school districts, as indicated by `fpc1`, and there are nine schools in District 731, one school in District 742, two schools in District 768, and so on as indicated by `fpc2`. For illustration purposes, the object `apiclus2_slim` has been created from `apiclus2`, which subsets the data to only the necessary columns and sorts the data.
```{r}
#| label: samp-des-api2clus-dis
apiclus2_slim <-
apiclus2 %>%
as_tibble() %>%
arrange(desc(dnum), snum) %>%
select(cds, dnum, snum, fpc1, fpc2, pw)
apiclus2_slim
```
To specify this design in R, we use the following:
```{r}
#| label: samp-des-api2clus-des
apiclus2_des <- apiclus2_slim %>%
as_survey_design(
ids = c(dnum, snum),
fpc = c(fpc1, fpc2),
weights = pw
)
apiclus2_des
summary(apiclus2_des)
```
The design objects are described as "2 - level Cluster Sampling design," and include the ids (cluster), FPC, and weight variables. The summary notes that the sample includes 40 first-level clusters (PSUs), which are school districts, and 126 second-level clusters (SSUs), which are schools. Additionally, the summary includes a numeric summary of the probabilities of selection and the population size (number of PSUs) as 757.
\index{Functions in srvyr!as\_survey\_design|)} \index{Finite population correction|)} \index{Clustered sampling|)} \index{Primary sampling unit|)}
## Combining sampling methods {#samp-combo}
\index{Finite population correction|)} \index{Stratified sampling|(} \index{Clustered sampling|(} \index{Primary sampling unit|(} \index{Strata|(}
SRS, stratified, and clustered designs are the backbone of sampling designs, and the features are often combined in one design. Additionally, rather than using SRS for selection, other sampling mechanisms are commonly used, such as probability proportional to size (PPS), systematic sampling, or selection with unequal probabilities, which are briefly described here. In PPS sampling, a size measure is constructed for each unit (e.g., the population of the PSU or the number of occupied housing units), and units with larger size measures are more likely to be sampled. Systematic sampling is commonly used to ensure representation across a population. Units are sorted by a feature, and then every $k$ units is selected from a random start point so the sample is spread across the population. \index{Burden|(}In addition to PPS, other unequal probabilities of selection may be used. For example, in a study of establishments (e.g., businesses or public institutions) that conducts a survey every year, an establishment that recently participated (e.g., participated last year) may have a reduced chance of selection in a subsequent round to reduce the burden on the establishment.\index{Burden|)} To learn more about sampling designs, refer to @valliant2013practical, @cox2011business, @cochran1977sampling, and @deming1991sample.
A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation highlights the variables necessary to specify the design. This is often found in the user guide, methodology report, analysis guide, or technical documentation (see Chapter \@ref(c03-survey-data-documentation) for more details).
\index{Finite population correction|)} \index{Stratified sampling|)} \index{Clustered sampling|)} \index{Primary sampling unit|)} \index{Strata|(}
### Example {-}
\index{Functions in srvyr!as\_survey\_design|(} \index{Finite population correction|)} \index{Stratified sampling|(} \index{Clustered sampling|(} \index{Primary sampling unit|(} \index{Strata|(}
For example, the [2017-2019 National Survey of Family Growth](https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf) had a stratified multi-stage area probability sample:
1. In the first stage, PSUs are counties or collections of counties and are stratified by Census region/division, size (population), and MSA status. Within each stratum, PSUs were selected via PPS.
2. In the second stage, neighborhoods were selected within the sampled PSUs using PPS selection.
3. In the third stage, housing units were selected within the sampled neighborhoods.
4. In the fourth stage, a person was randomly chosen among eligible persons within the selected housing units using unequal probabilities based on the person's age and sex.
The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is `SEST`, the cluster variable is `SECU`, and the weight variable is `WGT2017_2019`. Thus, to specify this design in R, we use the following syntax:
```r
nsfg_des <- nsfgdata %>%
as_survey_design(ids = SECU,
strata = SEST,
weights = WGT2017_2019)
```
\index{Functions in srvyr!as\_survey\_design|)} \index{Finite population correction|)} \index{Stratified sampling|)} \index{Clustered sampling|)} \index{Primary sampling unit|)} \index{Strata|)}
## Replicate weights
\index{Replicate weights|(}
Replicate weights are often included on analysis files instead of, or in addition to, the design variables (strata and PSUs). Replicate weights are used as another method to estimate variability. Often, researchers choose to use replicate weights to avoid publishing design variables (strata or clustering variables) as a measure to reduce the risk of disclosure. There are several types of replicate weights, including balanced repeated replication (BRR), Fay's BRR, jackknife, and bootstrap methods. An overview of the process for using replicate weights is as follows:
1. Divide the sample into subsample replicates that mirror the design of the sample
2. Calculate weights for each replicate using the same procedures for the full-sample weight (i.e., nonresponse and post-stratification)
3. Calculate estimates for each replicate using the same method as the full-sample estimate
4. Calculate the estimated variance, which is proportional to the variance of the replicate estimates
The different types of replicate weights largely differ between step 1 (how the sample is divided into subsamples) and step 4 (which multiplication factors, scales, are used to multiply the variance). The general format for the standard error is:
$$ \sqrt{\alpha \sum_{r=1}^R \alpha_r (\hat{\theta}_r - \hat{\theta})^2 }$$
where $R$ is the number of replicates, $\alpha$ is a constant that depends on the replication method, $\alpha_r$ is a factor associated with each replicate, $\hat{\theta}$ is the weighted estimate based on the full sample, and $\hat{\theta}_r$ is the weighted estimate of $\theta$ based on the $r^{\text{th}}$ replicate.
To create the design object for surveys with replicate weights, we use `as_survey_rep()` instead of `as_survey_design()`, which we use for the common sampling designs in the sections above.
### Balanced Repeated Replication method
\index{Replicate weights!Balanced repeated replication (BRR)|(} \index{Primary sampling unit|(} \index{Strata|(}
The balanced repeated replication (BRR) method requires a stratified sample design with two PSUs in each stratum. Each replicate is constructed by deleting one PSU per stratum using a Hadamard matrix. For the PSU that is included, the weight is generally multiplied by two but may have other adjustments, such as post-stratification. A Hadamard matrix is a special square matrix with entries of +1 or --1 with mutually orthogonal rows. Hadamard matrices must have one row, two rows, or a multiple of four rows. The size of the Hadamard matrix is determined by the first multiple of 4 greater than or equal to the number of strata. For example, if a survey had seven strata, the Hadamard matrix would be an $8\times8$ matrix. Additionally, a survey with eight strata would also have an $8\times8$ Hadamard matrix. The columns in the matrix specify the strata, and the rows specify the replicate. In each replicate (row), a +1 means to use the first PSU, and a --1 means to use the second PSU in the estimate. For example, here is a $4\times4$ Hadamard matrix:
$$ \begin{array}{rrrr} +1 &+1 &+1 &+1\\ +1&-1&+1&-1\\ +1&+1&-1&-1\\ +1 &-1&-1&+1 \end{array} $$
In the first replicate (row), all the values are +1; so in each stratum, the first PSU would be used in the estimate. In the second replicate, the first PSU would be used in strata 1 and 3, while the second PSU would be used in strata 2 and 4. In the third replicate, the first PSU would be used in strata 1 and 2, while the second PSU would be used in strata 3 and 4. Finally, in the fourth replicate, the first PSU would be used in strata 1 and 4, while the second PSU would be used in strata 2 and 3. For more information about Hadamard matrices, see @wolter2007introduction. Note that supplied BRR weights from a data provider already incorporate this adjustment, and the {survey} package generates the Hadamard matrix, if necessary, for calculating BRR weights; so an analyst does not need to create or provide the matrix.
\index{Primary sampling unit|)} \index{Strata|)}
#### The math {-}
A weighted estimate for the full sample is calculated as $\hat{\theta}$, and then a weighted estimate for each replicate is calculated as $\hat{\theta}_r$ for $R$ replicates. Using the generic notation above, $\alpha=\frac{1}{R}$ and $\alpha_r=1$ for each $r$. The standard error of the estimate is calculated as follows:
$$se(\hat{\theta})=\sqrt{\frac{1}{R} \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}$$
Specifying replicate weights in R requires specifying the type of replicate weights, the main weight variable, the replicate weight variables, and other options. One of the key options is for the mean squared error (MSE). If `mse=TRUE`, variances are computed around the point estimate $(\hat{\theta})$; whereas if `mse=FALSE`, variances are computed around the mean of the replicates $(\bar{\theta})$ instead, which looks like this:
$$se(\hat{\theta})=\sqrt{\frac{1}{R} \sum_{r=1}^R \left( \hat{\theta}_r-\bar{\theta}\right)^2}$$ where $$\bar{\theta}=\frac{1}{R}\sum_{r=1}^R \hat{\theta}_r$$
The default option for `mse` is to use the global option of "survey.replicates.mse," which is set to `FALSE` initially unless a user changes it. To determine if `mse` should be set to `TRUE` or `FALSE`, read the survey documentation. If there is no indication in the survey documentation for BRR, we recommend setting `mse` to `TRUE`, as this is the default in other software (e.g., SAS, SUDAAN).
#### The syntax {-}
\index{Functions in srvyr!as\_survey\_rep|(} \index{American Community Survey (ACS)|(}
Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, ..., PWGTP80 for the person weights in the American Community Survey (ACS) [@acs-pums-2021] or BRRWT1, BRRWT2, ..., BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) [@recs-2015-micro]. This makes it easy to use some of the [tidy selection](https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html) functions in R.
\index{American Community Survey (ACS)|)}
To specify a BRR design, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as BRR (`type = BRR`), and whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, ..., WT20, we can use the following syntax (both are equivalent):
```r
brr_des <- dat %>%
as_survey_rep(weights = WT0,
repweights = all_of(str_c("WT", 1:20)),
type = "BRR",
mse = TRUE)
brr_des <- dat %>%
as_survey_rep(weights = WT0,
repweights = num_range("WT", 1:20),
type = "BRR",
mse = TRUE)
```
If a dataset had WT for the main weight and had 20 BRR weights indicated REPWT1, REPWT2, ..., REPWT20, we can use the following syntax (both are equivalent):
```r
brr_des <- dat %>%
as_survey_rep(weights = WT,
repweights = all_of(str_c("REPWT", 1:20)),
type = "BRR",
mse = TRUE)
brr_des <- dat %>%
as_survey_rep(weights = WT,
repweights = starts_with("REPWT"),
type = "BRR",
mse = TRUE)
```
If the replicate weight variables are in the file consecutively, we can also use the following syntax:
```r
brr_des <- dat %>%
as_survey_rep(weights = WT,
repweights = REPWT1:REPWT20,
type = "BRR",
mse = TRUE)
```
Typically, each replicate weight sums to a value similar to the main weight, as both the replicate weights and the main weight are supposed to provide population estimates. Rarely, an alternative method is used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Chapter \@ref(c03-survey-data-documentation) for more information on reading documentation). In this case, the replicate weights are not combined, and the option `combined_weights = FALSE` should be indicated, as the default value for this argument is `TRUE`. This specific syntax is shown below:
```r
brr_des <- dat %>%
as_survey_rep(weights = WT,
repweights = starts_with("REPWT"),
type = "BRR",
combined_weights = FALSE,
mse = TRUE)
```
#### Example {-}
\index{Primary sampling unit|(}
The {survey} package includes a data example from section 12.2 of @levy2013sampling. In this fictional data, two out of five ambulance stations were sampled from each of three emergency service areas (ESAs); thus BRR weights are appropriate with two PSUs (stations) sampled in each stratum (ESA). In the code below, we create BRR weights as was done by @levy2013sampling.
```{r}
#| label: samp-des-brr-display
scdbrr <- scd %>%
as_tibble() %>%
mutate(wt = 5 / 2,
rep1 = 2 * c(1, 0, 1, 0, 1, 0),
rep2 = 2 * c(1, 0, 0, 1, 0, 1),
rep3 = 2 * c(0, 1, 1, 0, 0, 1),
rep4 = 2 * c(0, 1, 0, 1, 1, 0))
scdbrr
```
To specify the BRR weights, we use the following syntax:
```{r}
#| label: samp-scdbrr-des
scdbrr_des <- scdbrr %>%
as_survey_rep(type = "BRR",
repweights = starts_with("rep"),
combined_weights = FALSE,
weight = wt)
scdbrr_des
summary(scdbrr_des)
```
Note that `combined_weights` was specified as `FALSE` because these weights are simply specified as 0 and 2 and do not incorporate the overall weight. When printing the object, the type of replication is noted as Balanced Repeated Replicates, and the replicate weights and the weight variable are specified. Additionally, the summary lists the variables included in the data and design object.
\index{Primary sampling unit|)}
### Fay's BRR method
Fay's BRR method for replicate weights is similar to the BRR method in that it uses a Hadamard matrix to construct replicate weights. However, rather than deleting PSUs for each replicate, with Fay's BRR, half of the PSUs have a replicate weight, which is the main weight multiplied by $\rho$, and the other half have the main weight multiplied by $(2-\rho)$, where $0 \le \rho < 1$. Note that when $\rho=0$, this is equivalent to the standard BRR weights, and as $\rho$ becomes closer to 1, this method is more similar to jackknife discussed in Section \@ref(samp-jackknife). To obtain the value of $\rho$, it is necessary to read the survey documentation (see Chapter \@ref(c03-survey-data-documentation)).
#### The math {-}
The standard error estimate for $\hat{\theta}$ is slightly different than the BRR, due to the addition of the multiplier of $\rho$. Using the generic notation above, $\alpha=\frac{1}{R \left(1-\rho\right)^2}$ and $\alpha_r=1 \text{ for all } r$. The standard error is calculated as:
$$se(\hat{\theta})=\sqrt{\frac{1}{R (1-\rho)^2} \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}$$
#### The syntax {-}
The syntax is very similar for BRR and Fay's BRR. To specify a Fay's BRR design, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as Fay's BRR (`type = Fay`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and Fay's multiplier (`rho`). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated as WT1, WT2, ..., WT20, and Fay's multiplier is 0.3, we use the following syntax:
```r
fay_des <- dat %>%
as_survey_rep(weights = WT0,
repweights = num_range("WT", 1:20),
type = "Fay",
mse = TRUE,
rho = 0.3)
```
#### Example {-}
\index{Residential Energy Consumption Survey (RECS)|(}
The 2015 RECS [@recs-2015-micro] uses Fay's BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96, and the documentation specifies a Fay's multiplier of 0.5. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total energy cost, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the census region. We use the 2015 RECS data from the {srvyrexploR} package that provides data for this book (see the Prerequisites box at the beginning of this chapter). To specify the design for the `recs_2015` data, we use the following syntax:
```{r}
#| label: samp-des-recs-des
#| eval: TRUE
recs_2015_des <- recs_2015 %>%
as_survey_rep(weights = NWEIGHT,
repweights = BRRWT1:BRRWT96,
type = "Fay",
rho = 0.5,
mse = TRUE,
variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC))
recs_2015_des
summary(recs_2015_des)
```
In specifying the design, the `variables` option was also used to include which variables might be used in analyses. This is optional but can make our object smaller and easier to work with. When printing the design object or looking at the summary, the replicate weight type is re-iterated as `Fay's variance method (rho= 0.5) with 96 replicates and MSE variances`, and the variables are included. No weight or probability summary is included in this output, as we have seen in some other design objects.\index{Replicate weights!Balanced repeated replication (BRR)|)} \index{Residential Energy Consumption Survey (RECS)|)}
### Jackknife method {#samp-jackknife}
\index{Replicate weights!Jackknife|(} \index{Primary sampling unit|(}
There are three jackknife estimators implemented in {srvyr}: jackknife 1 (JK1), jackknife n (JKn), and jackknife 2 (JK2). The JK1 method can be used for unstratified designs, and replicates are created by removing one PSU at a time so the number of replicates is the same as the number of PSUs. If there is no clustering, then the PSU is the ultimate sampling unit (e.g., students).
\index{Strata|(}
The JKn method is used for stratified designs and requires two or more PSUs per stratum. In this case, each replicate is created by deleting one PSU from a single stratum, so the number of replicates is the number of total PSUs across all strata. The JK2 method is a special case of JKn when there are exactly 2 PSUs sampled per stratum. For variance estimation, we also need to specify the scaling constants.
\index{Strata|)}
#### The math {-}
Using the generic notation above, $\alpha=\frac{R-1}{R}$ and $\alpha_r=1 \text{ for all } r$. For the JK1 method, the standard error estimate for $\hat{\theta}$ is calculated as:
$$se(\hat{\theta})=\sqrt{\frac{R-1}{R} \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}$$
The JKn method is a bit more complex, but the coefficients are generally provided with restricted and public-use files. For each replicate, one stratum has a PSU removed, and the weights are adjusted by $n_h/(n_h-1)$ where $n_h$ is the number of PSUs in stratum $h$. The coefficients in other strata are set to 1. Denote the coefficient that results from this process for replicate $r$ as $\alpha_r$, then the standard error estimate for $\hat{\theta}$ is calculated as:
$$se(\hat{\theta})=\sqrt{\sum_{r=1}^R \alpha_r \left( \hat{\theta}_r-\hat{\theta}\right)^2}$$
\index{Primary sampling unit|)}
#### The syntax {-}
To specify the jackknife method, we use the survey documentation to understand the type of jackknife (1, n, or 2) and the multiplier. In the syntax, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as jackknife 1 (`type = "JK1"`), n (`type = "JKN"`), or 2 (`type = "JK2"`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and the multiplier (`scale`). For example, if the survey is a jackknife 1 method with a multiplier of $\alpha_r=(R-1)/R=19/20=0.95$, the dataset has WT0 for the main weight and 20 replicate weights indicated as WT1, WT2, ..., WT20, we use the following syntax:
```r
jk1_des <- dat %>%
as_survey_rep(
weights = WT0,
repweights = num_range("WT", 1:20),
type = "JK1",
mse = TRUE,
scale = 0.95
)
```
For a jackknife n method, we need to specify the multiplier for all replicates. In this case, we use the `rscales` argument to specify each one. The documentation provides details on what the multipliers ($\alpha_r$) are, and they may be the same for all replicates. For example, consider a case where $\alpha_r=0.1$ for all replicates, and the dataset had WT0 for the main weight and had 20 replicate weights indicated as WT1, WT2, ..., WT20. We specify the type as `type = "JKN"`, and the multiplier as `rscales=rep(0.1,20)`:
```r
jkn_des <- dat %>%
as_survey_rep(
weights = WT0,
repweights = num_range("WT", 1:20),
type = "JKN",
mse = TRUE,
rscales = rep(0.1, 20)
)
```
#### Example {-}
\index{Residential Energy Consumption Survey (RECS)|(}
The 2020 RECS [@recs-2020-micro] uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of $(R-1)/R=59/60$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the census region. We use the 2020 RECS data from the {srvyrexploR} package that provides data for this book (see the Prerequisites box at the beginning of this chapter).
To specify this design, we use the following syntax:
```{r}
#| label: samp-des-recs2020-des
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
type = "JK1",
scale = 59/60,
mse = TRUE,
variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC)
)
recs_des
summary(recs_des)
```
```{r}
#| label: samp-des-recs2020-des-save
#| echo: false
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
type = "JK1",
scale = 59/60,
mse = TRUE
)
```
When printing the design object or looking at the summary, the replicate weight type is reiterated as `Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances`, and the variables are included. No weight or probability summary is included. \index{Replicate weights!Jackknife|)} \index{Residential Energy Consumption Survey (RECS)|)}
### Bootstrap method
\index{Replicate weights!Bootstrap|(} \index{Primary sampling unit|(}
In bootstrap resampling, replicates are created by selecting random samples of the PSUs with replacement (SRSWR). If there are $A$ PSUs in the sample, then each replicate is created by selecting a random sample of $A$ PSUs with replacement. Each replicate is created independently, and the weights for each replicate are adjusted to reflect the population, generally using the same method as how the analysis weight was adjusted.
\index{Primary sampling unit|)}
#### The math {-}
A weighted estimate for the full sample is calculated as $\hat{\theta}$, and then a weighted estimate for each replicate is calculated as $\hat{\theta}_r$ for $R$ replicates. Then the standard error of the estimate is calculated as follows:
$$se(\hat{\theta})=\sqrt{\alpha \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}$$
where $\alpha$ is the scaling constant. Note that the scaling constant ($\alpha$) is provided in the survey documentation, as there are many types of bootstrap methods that generate custom scaling constants.
#### The syntax {-}
To specify a bootstrap method, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as bootstrap (`type = "bootstrap"`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and the multiplier (`scale`). For example, if a dataset had WT0 for the main weight, 20 bootstrap weights indicated WT1, WT2, ..., WT20, and a multiplier of $\alpha=.02$, we use the following syntax:
```r
bs_des <- dat %>%
as_survey_rep(
weights = WT0,
repweights = num_range("WT", 1:20),
type = "bootstrap",
mse = TRUE,
scale = .02
)
```
\index{Functions in srvyr!as\_survey\_rep|)}
#### Example {-}
\index{Primary sampling unit|(}
Returning to the APIP example, we are going to create a dataset with bootstrap weights to use as an example. In this example, we construct a one-cluster design with 50 replicate weights^[We provide the code here to replicate this example but are not focusing on the creation of the weights, as that is outside the scope of this book. We recommend referencing @wolter2007introduction for more information on creating bootstrap weights.].
```{r}
#| label: samp-des-genbs
apiclus1_slim <-
apiclus1 %>%
as_tibble() %>%
arrange(dnum) %>%
select(cds, dnum, fpc, pw)
set.seed(662152)
apibw <-
bootweights(psu = apiclus1_slim$dnum,
strata = rep(1, nrow(apiclus1_slim)),
fpc = apiclus1_slim$fpc,
replicates = 50)
bwmata <-
apibw$repweights$weights[apibw$repweights$index,] * apiclus1_slim$pw
apiclus1_slim <- bwmata %>%
as.data.frame() %>%
set_names(str_c("pw", 1:50)) %>%
cbind(apiclus1_slim) %>%
as_tibble() %>%
select(cds, dnum, fpc, pw, everything())
apiclus1_slim
```
The output of `apiclus1_slim` includes the same variables we have seen in other APIP examples (see Table \@ref(tab:apidata)), but now it additionally includes bootstrap weights `pw1`, ..., `pw50`. When creating the survey design object, we use the bootstrap weights as the replicate weights. Additionally, with replicate weights we need to include the scale ($\alpha$). For this example, we created:
$$\alpha=\frac{A}{(A-1)(R-1)}=\frac{15}{(15-1)*(50-1)}=0.02186589$$
where $A$ is the average number of PSUs per stratum, and $R$ is the number of replicates. There is only 1 stratum and the number of clusters/PSUs is 15 so $A=15$. Using this information, we specify the design object as:
\index{Primary sampling unit|)}
```{r}
#| label: samp-des-bsexamp
api1_bs_des <- apiclus1_slim %>%
as_survey_rep(weights = pw,
repweights = pw1:pw50,
type = "bootstrap",
scale = 0.02186589,
mse = TRUE)
api1_bs_des
summary(api1_bs_des)
```
As with other replicate design objects, when printing the object or looking at the summary, the replicate weights are provided along with the data variables. \index{Replicate weights|)} \index{Replicate weights!Bootstrap|)}
## Exercises
For this chapter, the exercises entail reading public documentation to determine how to specify the survey design. While reading the documentation, be on the lookout for description of the weights and the survey design variables or replicate weights.
1. The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS). The NHIS includes a wide variety of health topics for adults including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to receiving care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description [@nhis-svy-des]. The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation). We have imported the data and the variable containing the data as: `nhis_adult_data`. How would we specify the design using either `as_survey_design()` or `as_survey_rep()`?
2. The General Social Survey (GSS) is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook provides examples of setting up syntax in SAS and Stata but not R [@gss-codebook]. We have imported the data and the variable containing the data as: `gss_data`. How would we specify the design in R using either `as_survey_design()` or `as_survey_rep()`?