-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathoptimal_subsampling.Rmd
1293 lines (1124 loc) · 55.7 KB
/
optimal_subsampling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Optimal subsampling applied to imbalanced learning"
subtitle: "Model selection"
author: "Patrick Altmeyer"
date: "`r format(Sys.Date(), '%d %B, %Y')`"
output:
bookdown::html_document2:
code_folding: hide
number_sections: true
toc: true
toc_float: true
bibliography: bib.bib
---
```{r setup, include=F}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE, message = F, warning = F)
```
**Note**: *By default most code chunks are hidden. "Code" buttons can be used to unhide the code.*
This note investigates if and how systematic subsampling can be applied to imbalanced learning. The structure is as follows: to set the stage for the remainder of the analysis the first [section](#bias) briefly introduces the bias-variance trade-off. The following [section](#subsampling) then introduces various subsampling methods. Section \@ref(lin-reg) illustrates the improvements associated with non-uniform subsampling. The final two sections apply the developed ideas to binary classification problems with imbalanced training data. Section \@ref(concl) concludes.
# Bias-variance tradeoff {#bias}
```{r}
source("R/gauss_kernel.R")
source("R/sinusoidal.R")
source("R/sim_sinusoidal.R")
source("R/regularized_ls.R")
# Generate data: ----
n <- 25
p <- 24
a <- 0
b <- 1
x <- seq(a,b,length.out = n)
y <- sinusoidal(x)
v <- 0.3
```
To set the stage for the remainder of this note we will briefly revisit the bias-variance trade-off in this section. In particular we will illustrate the effect of varying the sample size $n$. Readers familiar with this topic may choose to skip this section.
As as in @bishop2006pattern we consider synthetic data generated by the sinusoidal function $f(x)=\sin(2\pi x)$. To simulate random samples of $\mathbf{y}$ we sample $n$ input values from $\mathbf{X} \sim \text{unif}(0,1)$ and introduce a random noise component $\varepsilon \sim \mathcal{N}(0,`r v`)$. Figure \@ref(fig:p-sim) shows $\mathbf{y}$ along with random draws $\mathbf{y}^*_n$.
```{r p-sim, fig.cap="Sinusoidal function and random draws."}
# True data: ----
library(ggplot2)
library(data.table)
dt_true <- data.table(y,x)
pl <- ggplot(data=dt_true, aes(x=x, y=y)) +
geom_line()
# Simulate data: ----
n_draws <- 3
dt_star <- rbindlist(
lapply(
1:n_draws,
function(x) {
simulated <- sim_sinusoidal(n=n, sigma = v)
data.table(y=simulated$y_star, x=simulated$x_star, n=1:n, draw=x)
}
)
)
pl +
geom_point(
data = dt_star,
aes(x=x, y=y, colour=factor(draw))
) +
scale_colour_discrete(
name="Draw:"
)
```
```{r param-setup}
lambda <- c(
exp(2.6),
exp(-0.31),
exp(-2.4)
)
s <- 0.1
n_draws <- 100
mu <- seq(a,b,length.out = p)
```
Following @bishop2006pattern we will use a Gaussian linear model with Gaussian kernels $\exp(-\frac{(x_k-\mu_p)^{2}}{2s^2})$ as
$$
\begin{equation}
\begin{aligned}
&& \mathbf{y}|\mathbf{X}& =f(x) \sim \mathcal{N} \left( \sum_{j=0}^{p-1} \phi_j(x)\beta_j, v \mathbb{I}_p \right) \\
\end{aligned}
(\#eq:model)
\end{equation}
$$
with $v=`r v`$ to estimate $\hat{\mathbf{y}}_k$ from random draws $\mathbf{X}_k$. We fix the number of kernels $p=`r p`$ (and hence the number of features $M=p+1=`r p+1`$) as well as the spatial scale $s=`r s`$. To vary the complexity of the model we use a form of regularized least-squares (*Ridge regression*) and let the regularization parameter $\lambda$ vary
$$
\begin{equation}
\begin{aligned}
&& \hat\beta&=(\lambda I + \Phi^T \Phi)^{-1}\Phi^Ty \\
\end{aligned}
(\#eq:reg-ls)
\end{equation}
$$
where high values of $\lambda$ in \@ref(eq:reg-ls) shrink parameter values towards zero. (Note that a choice $\lambda=0$ corresponds to the OLS estimator which is defined as long as $p \le n$.)
As in @bishop2006pattern we proceed as follows for each choice of $\lambda$ and each sample draw to illustrate the bias-variance trade-off:
1. Draw $N=`r n`$ time from $\mathbf{u}_k \sim \text{unif}(`r a`,`r b`)$.
2. Let $\mathbf{X}_k^*=\mathbf{u}_k+\varepsilon_k$ with $\varepsilon \sim \mathcal{N}(0, `r v`)$.
3. Compute $\mathbf{y}_k^*=\sin(2\pi \mathbf{X}^*_k)$.
4. Extract features $\Phi_k$ from $\mathbf{X}_k^*$ and estimate the parameter vector $\beta_k^*(\Phi_k,\mathbf{y}^*_k,\lambda)$ through regularized least-squares.
5. Predict $\hat{\mathbf{y}}_k^*=\Phi \beta_k^*$.
```{r}
Phi <- cbind(
rep(1,n),
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x, mu=mu_p, s = s)
}
)
)
dt <- rbindlist(
lapply( # loop - draw K times
1:n_draws,
function(k) {
# Draw:
simulated <- sim_sinusoidal(n=n, sigma = v)
y_k <- simulated$y_star
x_k <- simulated$x_star
rbindlist(
lapply( # loop over regularization parameter
1:length(lambda),
function(t) {
# Regularization parameter:
lambda_t <- lambda[t]
# Extract features:
Phi_k <- cbind(
rep(1,n),
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x_k, mu=mu_p, s = s)
}
)
)
beta_hat <- regularized_ls(Phi_k,y_k,lambda_t) # fit model on (y,x)
y_hat <- c(Phi %*% beta_hat) # predict from model
dt <- data.table(value=y_hat,draw=k,lambda=lambda_t,n=1:n,x=x)
return(dt)
}
)
)
}
)
)
dt[,facet_group:="single"]
dt[,colour_group:="estimates"]
# Expected values:
dt_exp = dt[,.(value=mean(value)),by=.(lambda,n,x)]
dt_exp[,facet_group:="aggregate"]
dt_exp[,colour_group:="estimates"]
dt_exp[,draw:=1] # just for aesthetics
# True values:
library(reshape)
dt_true = data.table(expand.grid.df(data.frame(value=y,x=x),data.frame(lambda=lambda)))
dt_true[,facet_group:="aggregate"]
dt_true[,colour_group:="true"]
dt_true[,draw:=2] # just for aesthetics
# Plot data:
dt_plot = rbind(
dt,
dt_exp,
dt_true,
fill=T
)
```
Applying the above procedure we can construct the familiar picture that demonstrates how increased model complexity increases variance while reducing bias (Figure \@ref(fig:plot-bias-var)). Recall that for the mean-squared error (MSE) we have
$$
\begin{equation}
\begin{aligned}
&& \mathbb{E} \left( (\hat{f}_n(x)-f(x))^2 \right)
&= \text{var} (\hat{f}_n(x)) + \left( \mathbb{E} \left( \hat{f}_n(x) \right) - f(x) \right)^2 \\
\end{aligned}
(\#eq:mse)
\end{equation}
$$
where the first term on the right-hand side corresponds to the variance of our prediction and the second term to its (squared) bias. In Figure \@ref(fig:plot-bias-var) as model complexity increases the variance component of the MSE increases, while the bias term diminishes. A similar pattern would have been observed if instead of using regularization we had used OLS and let the number of Gaussian kernels (and hence the number of features $p$) vary where higher values of $p$ correspond to increased model complexity.
```{r plot-bias-var, fig.cap="Bias-variance trade-off"}
dt_plot[,log_lambda := log(lambda)]
ggplot(data=dt_plot[draw<=25], aes(y=value, x=x, colour=colour_group, group=draw)) +
geom_line() +
facet_grid(
rows = vars(log_lambda),
cols = vars(facet_group)
) +
scale_color_discrete(
name="Type:"
) +
labs(
y = "f(x)"
)
```
```{r sim-change-n}
n <- 100
m <- c(0.10, 0.5, 0.90)
x <- seq(a,b,length.out = n)
y <- sinusoidal(x)
lambda <- exp(-0.31)
Phi <- cbind(
1,
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x, mu=mu_p, s = s)
}
)
)
dt <- rbindlist(
lapply( # loop - draw K times
1:n_draws,
function(k) {
rbindlist(
lapply( # loop over sample sizes:
1:length(m),
function(t) {
n_sample <- m[t] * n
# Draw:
simulated <- sim_sinusoidal(n=n_sample, sigma = v)
y_k <- simulated$y_star
x_k <- simulated$x_star
# Extract features:
Phi_k <- cbind(
1,
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x_k, mu=mu_p, s = s)
}
)
)
beta_hat <- regularized_ls(Phi_k,y_k,lambda) # fit model on (y,x)
y_hat <- c(Phi %*% beta_hat) # predict from model
dt <- data.table(value=y_hat,draw=k,sample_size=n_sample,n=1:n,x=x)
return(dt)
}
)
)
}
)
)
dt[,facet_group:="single"]
dt[,colour_group:="estimates"]
# Expected values:
dt_exp = dt[,.(value=mean(value)),by=.(sample_size,n,x)]
dt_exp[,facet_group:="aggregate"]
dt_exp[,colour_group:="estimates"]
dt_exp[,draw:=1] # just for aesthetics
# True values:
library(reshape)
dt_true = data.table(expand.grid.df(data.frame(value=y,x=x),data.frame(sample_size=m*n)))
dt_true[,facet_group:="aggregate"]
dt_true[,colour_group:="true"]
dt_true[,draw:=2] # just for aesthetics
# Plot data:
dt_plot = rbind(
dt,
dt_exp,
dt_true,
fill=T
)
```
The focus of this note is instead on varying the sample size $n$. It should not be surprising that both the variance and bias component of the MSE decrease as the sample size $n$ increases (Figure \@ref(fig:plot-bias-var-n)). But in today's world $n$ can potentially be very large, so much so that even computing simple linear models can be hard. Suppose for example you wanted to use patient data that is generated in real-time as a global pandemic unfolds to predict the trajectory of said pandemic. Or consider the vast quantities of potentially useful user-generated data that online service providers have access to. In the remainder of this note we will investigate how systematic subsampling can help improve model accuracy in these situations.
```{r plot-bias-var-n, fig.cap="Bias-variance trade-off"}
ggplot(data=dt_plot[draw<=25], aes(y=value, x=x, colour=colour_group, group=draw)) +
geom_line() +
facet_grid(
rows = vars(sample_size),
cols = vars(facet_group)
) +
scale_color_discrete(
name="Type:"
) +
labs(
y = "f(x)"
)
```
# Subsampling methods {#subsampling}
```{r}
source("R/UNIF.R")
source("R/BLEV.R")
source("R/OPT.R")
source("R/PL.R")
source("R/wls_qr.R")
source("R/rorthogonal.R")
source("R/sim_subsampling.R")
n <- 1000
m <- 10
```
The case for subsampling generally involves $n >> p$, so very large values of $n$. In such cases we may be interested in estimating $\hat\beta_m$ instead of $\hat\beta_n$ where $p\le m<<n$ with $m$ freely chosen by us. In practice we may want to do this to avoid high computational costs associated with large $n$ as discussed above. The basic algorithm for estimating $\hat\beta_m$ is simple:
1. Subsample with replacement from the data with some sampling probability $\{\pi_i\}^n_{i=1}$.
2. Estimate least-squares estimator $\hat\beta_m$ using the subsample.
But there are at least two questions about this algorithm: firstly, how do we choose $\mathbf{X}_m=({\mathbf{X}^{(1)}}^T,...,{\mathbf{X}^{(m)}}^T)^T$? Secondly, how should we construct $\hat\beta_m$? With respect to the former, a better idea than just randomly selecting $\mathbf{X}_m$ might be to choose observations with high influence. We will look at a few of the different subsampling methods investigated and proposed in @zhu2015optimal, which differ primarily in their choice of subsampling probabilities $\{\pi_i\}^n_{i=1}$:
1. Uniform subsampling (UNIF): $\{\pi_i\}^n_{i=1}=1/n$.
2. Basic leveraging (BLEV): $\{\pi_i\}^n_{i=1}=h_{ii}/ \text{tr}(\mathbf{H})=h_{ii}/p$ where $\mathbf{H}$ is the *hat matrix*.
3. Optimal (OPT) and predictor-length sampling (PL): involving $||\mathbf{X}_i||/ \sum_{j=1}^{n}||\mathbf{X}_j||$ where $||\mathbf{X}||$ denotes the $L_2$ norm of $\mathbf{X}$.
Methods involving predictor-lengths are proposed by the authors with the former shown to be optimal (more on this below). PL subsampling is shown to scale very well and a good approximation of optimal subsampling conditional on leverage scores $h_{ii}$ being fairly homogeneous.
With respect to the second question @zhu2015optimal investigate both ordinary least-squares (OLS) and weighted least-squares (WLS), where weights simply correspond to subsampling probabilities $\{\pi_i\}^n_{i=1}$. The authors present empirical evidence that OLS is more efficient than WLS in that the mean-squared error (MSE) for predicting $\mathbf{X} \beta$ is lower for OLS. The authors also note though that subsampling using OLS is not consistent for non-uniform subsampling methods meaning that the bias cannot be controlled. Given Equation \@ref(eq:mse) the fact that OLS is nonetheless more efficient than WLS implies that the higher variance terms associated with WLS dominates the effect of relatively higher bias with OLS. In fact this is consistent with the theoretical results presented in @zhu2015optimal (more on this below).
Next we will briefly run through different estimation and subsampling methods in some more detail and see how they can be implemented in R. In the following [section](#lin-reg) we will then look at how the different approaches perform empirically.
## OLS and WLS
Both OLS and WLS are implemented here using QR decomposition. As for OLS this is very easily done in R. Given some feature matrix `X` and a corresponding outcome variable `y` we can use `qr.solve(X, y)` to compute $\hat\beta$. For WLS we need to first weigh observations by their corresponding subsampling probabilities. Following @zhu2015optimal we can construct a weighting matrix $\Phi= \text{diag}\{\pi_i\}^m_{i=1}$ and compute the weighted least-squares estimator as: (see [appendix](#app-wls) for derivation)
$$
\begin{equation}
\begin{aligned}
&& \hat\beta_m^{WLS}&= \left( \mathbf{X}^T \Phi^{-1} \mathbf{X} \right)^{-1} \mathbf{X}^T\Phi^{-1}\mathbf{y}\\
\end{aligned}
(\#eq:wls)
\end{equation}
$$
In R weighted least-squares can be implemented as follows
```{r class.source = "fold-show", code=readLines("R/wls_qr.R"), echo=T}
```
where in order to implement the algorithm propose in @zhu2015optimal the weights we need to supply as the function arguments are $w=\{1/\pi_i\}^m_{i=1}$. This follows from the following property of diagonal matrices:
$$
\begin{aligned}
&& \Phi^{-1}&= \begin{pmatrix}
1\over\phi_{11} & 0 & 0 \\
0 & ... & 0 \\
0 & 0 & 1\over \phi_{nn}
\end{pmatrix}
\\
\end{aligned}
$$
## Uniform leveraging (UNIF)
A simple function for uniform leveraging in R is shown in the code chunk below. Note that to streamline the comparison of the different methods in the following [section](#lin-reg) the function takes an unused argument `weighted=F` which for the other subsampling methods can be used to determine whether OLS or WLS should be used. Of course, with uniform leveraging the weights are all identical and hence $\hat\beta^{OLS}=\hat\beta^{WLS}$ so the argument is passed to but not evaluated in `UNIF`.
```{r class.source = "fold-show", code=readLines("R/UNIF.R"), echo=T}
```
## Basic leveraging (BLEV)
The `UNIF` function can be extended easily to the case with basic leveraging (see code below). Note that in this case the `weighted` argument is evaluated.
```{r class.source = "fold-show", code=readLines("R/BLEV.R"), echo=T}
```
### A note on computing leverage scores
Recall that for the *hat matrix* we have
$$
\begin{equation}
\begin{aligned}
&& \mathbf{H}&=\Phi (\Phi^T \Phi)^{-1}\Phi^T \\
\end{aligned}
(\#eq:hat-mat)
\end{equation}
$$
where the diagonal elements $h_{ii}$ correspond to the leverage scores we're after. Following @zhu2015optimal we will use (compact) singular value decomposition to obtain $\mathbf{H}$ rather than computing \@ref(eq:hat-mat) directly. This has the benefit that there exist exceptionally stable numerical algorithms to compute SVD. To see how and why we can use SVD to obtain $\mathbf{H}$ see [appendix](#app-svd).
Clearly to get $h_{ii}$ we first need to compute $\mathbf{H}$ which in terms of computational costs is of order $\mathcal{O}(np^2)=\max(\mathcal{O}(np^2),\mathcal{O}(p^3))$. The fact that we use all $n$ rows of $\Phi$ to compute leverage scores even though we explicitly stated our goal was to only use $m$ observations may rightly seem like a bit of a paradox. This is why fast algorithms that approximate leverage scores have been proposed. We will not look at them specifically here mainly because the PL method proposed by @zhu2015optimal does not depend on leverage scores and promises to be computationally even more efficient.
## Predictor-length sampling (PL)
The basic characteristic of PL subsampling - choosing $\{\pi_i\}^n_{i=1}= ||\mathbf{X}_i||/ \sum_{j=1}^{n}||\mathbf{X}_j||$ - was already introduced above. Again it is very easy to modify the subsampling functions from above to this case:
```{r class.source = "fold-show", code=readLines("R/PL.R")}
```
### A note on optimal subsampling (OPT)
In fact, PL subsampling is an approximate version of optimal subsampling (OPT). @zhu2015optimal show that asymptotically we have:
$$
\begin{equation}
\begin{aligned}
&&\text{plim} \left( \text{var} (\hat{f}_n(x)) \right) > \text{plim} \left(\left( \mathbb{E} \left( \hat{f}_n(x) \right) - f(x) \right)^2 \right) \\
\end{aligned}
(\#eq:plim)
\end{equation}
$$
Given this result minimizing the MSE (Equation \@ref(eq:mse)) with respect to subsampling probabilities $\{\pi_i\}^n_{i=1}$ corresponds to minimizing $\text{var} (\hat{f}_n(x))$. They further show that this minimization problem has the following closed-form solution:
$$
\begin{equation}
\begin{aligned}
&& \pi_i&= \frac{\sqrt{(1-h_{ii})}||\mathbf{X}_i||}{\sum_{j=1}^n\sqrt{(1-h_{jj})}||\mathbf{X}_j||}\\
\end{aligned}
(\#eq:opt)
\end{equation}
$$
This still has computational costs of order $\mathcal{O}(np^2)$. But it should now be clear why PL subsampling is optimal conditional on leverage scores being homogeneous (see [appendix](#app-pl)). PL subsampling is associated with computational costs of order $\mathcal{O}(np)$, so a potentially massive improvement. The code for optimal subsampling is shown below:
```{r class.source = "fold-show", code=readLines("R/OPT.R")}
```
### A note on computing predictor lengths
Computing the Euclidean norms $||\mathbf{X}_i||$ in R can be done explicitly by looping over the rows of $\mathbf{X}$ and computing the norm in each iteration. It turns out that this computationally very expensive. A much more efficient way of computing the vector of predictor lengths is as follows
$$
\begin{equation}
\begin{aligned}
&& \mathbf{pl}&=\sqrt{\mathbf{X}^2 \mathbf{1}} \\
\end{aligned}
(\#eq:norm)
\end{equation}
$$
where $\mathbf{X}^2$ indicates *elements squared*, the square root is also taken *element-wise* and $\mathbf{1}$ is a $(p \times 1)$ vectors of ones. A performance benchmark of the two approaches is shown in Figure \@ref(fig:mbm-norm) below.
```{r mbm-norm, fig.cap="Benchmark of Euclidean norm computations."}
n <- 200
p <- 100
library(microbenchmark)
X <- matrix(rnorm(n*p),nrow=n)
check_for_equality <- function(values) {
tol <- 1e-12
error <- max(abs(values[[1]] - values[[2]]))
error < tol
}
mb <- microbenchmark(
"Loop" = {sapply(1:nrow(X), function(i) norm(as.matrix(X[i,]), type="f"))},
"Matrix product" = {sqrt(X**2 %*% rep(1,ncol(X)))},
check = check_for_equality
)
autoplot(mb)
```
## Comparison of methods
```{r}
n <- 1000
p <- 100
```
As discussed in @zhu2015optimal both OPT and PL subsampling tend to inflate subsampling probabilities of observations with low leverage scores and shrink those of high-leverage observations relative to BLEV. They show explicitly that this always holds for orthogonal design matrices. As a quick sense-check of the functions introduced above we can generate a random orthogonal design matrix $\mathbf{X}$ and plot subsampling probabilities with OPT and PL against those obtained with BLEV. Figure \@ref(fig:comp-methods) illustrates this relationship nicely. The design matrix $\mathbf{X}$ $(n \times p)$ with $n=`r n`$ and $p=`r p`$ was generated using SVD:
```{r class.source = "fold-show", code=readLines("R/rorthogonal.R")}
```
```{r comp-methods, fig.cap="Comparison of subsampling probabilities.", fig.height=3, fig.width=6}
X <- rorthogonal(n,p)
# Subsampling methods:
methods <- list(
"UNIF" = UNIF,
"BLEV" = BLEV,
"OPT" = OPT,
"PL" = PL
)
smpl_probs <- rbindlist(
lapply(
names(methods)[2:4],
function(i) {
method <- i
prob <- methods[[method]](X, y=NULL, m=NULL, prob_only=T)
return(data.table(prob=prob, method=method))
}
)
)
dt_plot <- rbindlist(
lapply(
names(methods)[3:4],
function(i) {
other_method <- i
x <- smpl_probs[method=="BLEV"]$prob
y <- smpl_probs[method==other_method]$prob
data.table(x=x, y=y, y_var=other_method)
}
)
)
ggplot(dt_plot, aes(x=x, y=y)) +
geom_abline(intercept = 0, slope=1, linetype="dashed") +
geom_point(shape=1) +
facet_grid(
cols=vars(y_var)
) +
labs(
x="BLEV",
y=""
)
```
# Linear regression model {#lin-reg}
## A review of @zhu2015optimal
To illustrate the improvements associated with the methods proposed in @zhu2015optimal, we will briefly replicate their main empirical findings here. The evaluate the performance of the different methods we will proceed as follows:
**Empirical exercise**
1. Generate synthetic data $\mathbf{X}$ of dimension $(n \times m)$ with $n>>m$.
2. Set some true model parameter $\beta=(\mathbf{1}^T_{\overline{m*0.6}},\mathbf{1}^T_{\underline{m*0.4}})^T$.
3. Model the outcome variable as $\mathbf{y}=\mathbf{X}\beta+\epsilon$ where $\epsilon \sim \mathcal{N}(\mathbf{0},\sigma^2 \mathbf{I}_n)$ and $\sigma=10$.
4. Estimate the full-sample OLS estimator $\hat\beta_n$ (a benchmark estimator of sorts in this setting).
5. Use one of the subsampling methods to estimate iteratively $\{\hat\beta^{(b)}_m\}^B_{b=1}$. Note that all subsampling methods are stochastic so $\hat\beta_m$ varies across iterations.
6. Evaluate average model performance of $\hat\beta_m$ under the mean-squared error criterium: $MSE= \frac{1}{B} \sum_{b=1}^{B} MSE^{(b)}$ where $MSE^{(b)}$ corresponds to the in-sample estimator of the mean-squared error of the $b$-th iteration.
This exercise - and the once that follow - are computationally expensive. For them to be feasible we have to refrain from bias-correcting the in-sample estimator of the MSE (see [appendix](#app-session) for session info). We will use a simple [wrapper function](R/sim_subsampling.R) to implement the empirical exercise in R.
```{r, include=F}
n <- 1000
p <- 3
```
As in @zhu2015optimal we will generate the design matrix $\mathbf{X}$ from 5 different distributions: 1) Gaussian (GA) with $\mathcal{N}(\mathbf{0},\Sigma)$; 2) Mixed-Gaussian (MG) with $0.5\mathcal{N}(\mathbf{0},\Sigma)+0.5\mathcal{N}(\mathbf{0},25\Sigma)$; 3) Log-Gaussian (LN) with $\log\mathcal{N}(\mathbf{0},\Sigma)$; 4) T-distribution with 1 degree of freedom (T1) and $\Sigma$; 5) T-distribution as in 4) but truncated at $[-p,p]$. All parameters are chosen in the same way as in @zhu2015optimal with exception of $n=`r n`$ and $p=`r p`$, which are significantly smaller choices in order to decrease the computational costs. The corresponding densities of the 5 data sets are shown in Figure \@ref(fig:dens) in the [appendix](#app-dens).
We will run the empirical exercise for each data set and each subsampling method introduced above. Figure \@ref(fig:plot-smpl-prob) shows logarithms of the sampling probabilities corresponding to the different subsampling methods (UNIF not shown for obvious reasons). The plots look very similar to the one in @zhu2015optimal and is shown here primarily to reassure ourselves that we have implemented their ideas correctly. One interesting observation is worth pointing out however: note how the distributions for OPT and PL have lower standard deviations compared to BLEV. This should not be altogether surprising since we already saw above that for orthogonal design matrices the former methods inflate small leverage scores while shrinking high scores. But it is interesting to see that the same appears to hold for design matrices that are explicitly not orthogonal given our choice of $\Sigma$.
```{r zhu-data}
set.seed(1)
library(expm)
matrix_grid <- expand.grid(i=1:p,j=1:p)
Sigma <- matrix(rep(0,p^2),p,p)
for (x in 1:nrow(matrix_grid)) {
i <- matrix_grid$i[x]
j <- matrix_grid$j[x]
Sigma[i,j] <- 2 * (0.8)^(abs(i-j))
}
# 1.) Design matrix (as in Zhu et al): ----
GA <- matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm(t(Sigma))
# Gaussian mixture:
gaus_mix <- list(
gaus_1 = matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm(t(Sigma)),
gaus_2 = matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm((25 * t(Sigma)))
)
MG <- matrix(rep(0,n*p),n,p)
for (i in 1:nrow(MG)) {
x <- sample(1:2,1)
MG[i,] <- gaus_mix[[x]][i,]
}
# Log-Gaussian:
LN <- exp(GA)
# T-distribution:
T1 <- matrix(rt(n*p,1), nrow = n, ncol = p) %*% sqrtm(t(Sigma))
# Truncated T:
TT <- T1
TT[TT>p] <- p
TT[TT<(-p)] <- -p
data_sets <- list(
GA = list(X = GA),
MG = list(X = MG),
LN = list(X = LN),
TT = list(X = TT),
T1 = list(X = T1)
)
# 2.) Outcome:
data_sets <- lapply(
data_sets,
function(i) {
X <- i[["X"]]
beta <- c(rep(1,ceiling(0.6*p)),rep(0.1,floor(0.4*p)))
eps <- rnorm(n=n,mean=0,sd=10)
y <- X %*% beta + eps
list(X=X, y=y)
}
)
saveRDS(data_sets, "data/synthetic_data.rds")
```
```{r smpl-prob, eval=F, include=F}
pgrid <- expand.grid(method = names(methods)[2:4], data=names(data_sets))
smpl_probs <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
method <- as.character(pgrid$method[i])
data_set <- pgrid$data[i]
X <- data_sets[[data_set]]$X
y <- data_sets[[data_set]]$y
prob <- methods[[method]](X, y, m=NULL, prob_only=T)
return(data.table(prob=prob, method=method, data=data_set))
}
)
)
saveRDS(smpl_probs, "outputs/smpl_probs.rds")
```
```{r, eval=F, fig.width=10, fig.height=4, fig.cap="Logarithm of sampling probabilities."}
smpl_probs <- readRDS("outputs/smpl_probs.rds")
pgrid <- expand.grid(method = names(methods)[3:4], data=names(data_sets))
dt_plot <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
other_method <- as.character(pgrid$method[i])
data_set <- pgrid$data[i]
x <- smpl_probs[method=="BLEV" & data==data_set]$prob
y <- smpl_probs[method==other_method & data==data_set]$prob
data.table(x=x, y=y, y_var=other_method, data=data_set)
}
)
)
ggplot(dt_plot, aes(x=x, y=y)) +
geom_abline(intercept = 0, slope=1, linetype="dashed") +
geom_point(shape=1) +
facet_wrap(
y_var ~ data,
scales = "free",
ncol=length(data_sets)
) +
labs(
x="BLEV",
y=""
)
```
```{r plot-smpl-prob, fig.height=3.5, fig.width=10, fig.cap="Sampling probabilities for different subsampling methods."}
smpl_probs <- readRDS("outputs/smpl_probs.rds")
ggplot(
data = smpl_probs,
aes(x=data, y=log(prob))
) +
geom_boxplot() +
facet_grid(
col=vars(method)
)
```
```{r}
# Simulation parameters:
m <- c()
counter <- 1
while(length(m)<7 & max(m/n)<.1) {
temp <- max(p,(2^(counter)*1e-3) * n)
m <- unique(c(m, temp))
counter <- counter + 1
}
weighted <- c(T,F)
pgrid <- data.table(
expand.grid(
m=m,
method=names(methods),
weighted=weighted,
data_set=names(data_sets)
)
)
```
```{r,eval=F}
set.seed(1)
grid_search <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
m <- pgrid[i,m]
estimator_name <- pgrid[i,method]
estimator <- methods[[estimator_name]]
weighted <- pgrid[i, weighted]
data_set <- pgrid$data[i]
X <- data_sets[[data_set]]$X
y <- data_sets[[data_set]]$y
output <- sim_subsampling(X, y, m, estimator, weighted=weighted)
output[,m:=m]
output[,method:=estimator_name]
output[,weighted:=weighted]
output[,data_set:=data_set]
}
)
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("OLS","WLS")
grid_search[,log_value:=log(value)]
saveRDS(grid_search, file="outputs/grid_search_zhu.rds")
```
```{r}
grid_search <- readRDS("outputs/grid_search_zhu.rds")
p_list <- lapply(
1:2,
function(i) {
ggplot(
data=grid_search[weighted==levels(weighted)[i]],
aes(x=m/n, y=log_value, colour=method)
) +
geom_line() +
geom_point() +
facet_grid(
rows = vars(variables),
cols = vars(data_set),
scales = "free_y"
) +
scale_color_discrete(
name="Subsampling:"
) +
labs(
x="m/n",
y="Logarithm",
title=levels(grid_search$weighted)[i]
)
}
)
names(p_list) <- levels(grid_search$weighted)
```
Figures \@ref(fig:wls-zhu) and \@ref(fig:ols-zhu) show the resulting MSE, squared bias and variance for the different subsampling methods and data sets using weighed least-squares and ordinary least-squares, respectively. The subsampling size increases along the horizontal axis. The figures are interactive to allow readers to zoom in etc.
For the data sets that are also shown in @zhu2015optimal we find the same overall pattern: PL and OPT outperform other methods when using weighted least-squares, while BLEV outperforms other methods when using unweighted/ordinary least-squares.
For Gaussian data (GA) the differences between the methods are minimal since data points are homogeneous. A similar picture emerges when running the method comparison for the sinusoidal data introduced above (see [appendix](#app-sin)). In fact, @zhu2015optimal recommend to just rely on uniform subsampling when data is Gaussian. Another interesting observation is that for t-distributed data (T1) the non-uniform subsampling methods significantly outperform uniform subsampling methods. This is despite the fact that in the case of T1 data the conditions used to establish asymptotic consistency of the non-uniform subsampling methods in @zhu2015optimal are not fulfilled: in particular the fourth moment is not finite (in fact it is not defined).
```{r wls-zhu, fig.height=5, fig.width=10, fig.cap="MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with weighted least-squares."}
library(plotly)
layout_ggplotly <- function(gg, x = -0.075, y = -0.025, x_legend=1.05, y_legend=0.95, mar= list(l=50,r=150,b=50)){
# The 1 and 2 goes into the list that contains the options for the x and y axis labels respectively
gg[['x']][['layout']][['annotations']][[1]][['y']] <- x
gg[['x']][['layout']][['annotations']][[2]][['x']] <- y
gg[['x']][['layout']][['annotations']][[11]][['x']] <- x_legend
gg[['x']][['layout']][['legend']][['y']] <- y_legend
gg[['x']][['layout']][['legend']][['x']] <- x_legend
gg %>% layout(margin = mar)
}
gg <- ggplotly(p_list$WLS)
layout_ggplotly(gg)
```
```{r ols-zhu, fig.height=5, fig.width=10, fig.cap="MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with ordinary least-squares."}
gg <- ggplotly(p_list$OLS)
layout_ggplotly(gg)
```
## Computational performance
```{r}
n <- 200
p <- 100
```
We have already seen above that theoretically speaking both BLEV and OPT subsampling are computationally more expensive than PL subsampling (with UNIF subsampling the least expensive). It should be obvious that in light of their computational costs $\mathcal{O}(np^2)$ the former two methods do not scale well in higher-dimensional problems (higher $p$). @zhu2015optimal demonstrate this through empirical exercises to an extent that is beyond the scope of this note. Instead we will just quickly benchmark the different functions for non-uniform subsampling introduced above: `BLEV`, `OPT`, `PL` for $n=`r n`$, $p=`r p`$. We are only interested in how long it takes to compute subsampling probabilities, and since for `UNIF` all subsampling probabilities are simply $\pi_i=1/n$ we neglect this here. Figure \@ref(fig:mbm-methods) benchmarks the three non-uniform subsampling methods. Evidently PL subsampling is computationally much less costly.
```{r mbm-methods, fig.cap="Benchmark of computational performance of different methods."}
library(microbenchmark)
X <- matrix(rnorm(n*p),nrow=n)
y <- rnorm(n)
mb <- microbenchmark(
"BLEV" = {BLEV(X,y,prob_only = T)},
"OPT" = {OPT(X,y,prob_only = T)},
"PL" = {PL(X,y,prob_only = T)}
)
autoplot(mb)
```
# Classification problems
```{r}
source("R/undersampling.R")
source("R/sim_undersampling.R")
source("R/logit_irls.R")
```
In binary classification problems we are often faced with the issue of imbalanced training data - one of the two classes is under-represented relative to the other. This generally makes classifiers less sensitive to the minority class which often is the class we want to predict (@branco2015survey). Suppose for example we wanted to predict the probability of death for patients who suffer from COVID-19. The case-fatality rate for the virus is significantly lower than 10% so any data we could obtain on patients would inevitably be imbalanced: the domain is skewed towards the class we are not interested in predicting.
A common and straight-forward way to deal with this issue is to randomly over- or under-sample the training data. Let $y_i\in{0,1}$ for all $i=1,...,n$. We are interested in modelling $p_i=P(y_i=1|x_i)$ but our data is imbalanced: $n_{y=0}>>n_{y=1}$ where $n=n_{y=0}+n_{y=1}$. Then random over- and under-sampling works as follows:
1. *Random oversampling*: draw $y_i$ from minority class $\mathbf{y}_{n_{y=1}}$ with probability $\{\pi_i\}^{n_{y=1}}_{i=1}=1/n_{y=1}$ and append $\mathbf{y}_{n_{y=1}}$ by $y_i$ so that $n_{y=1} \leftarrow n_{y=1}+1$ until $n_{y=0}=n_{y=1}$.
2. *Random undersampling*: draw $y_i$ from majority class $\mathbf{y}_{n_{y=0}}$ with probability $\{\pi_i\}^{n_{y=0}}_{i=0}=1/n_{y=0}$ and remove $y_i$ from $\mathbf{y}_{n_{y=0}}$ so that $n_{y=0} \leftarrow n_{y=0}-1$ until $n_{y=0}=n_{y=1}$.
In a way both these methods correspond to uniform subsampling (UNIF) discussed above. Random oversampling may lead to overfitting. Conversely, random undersampling may come at the cost of eliminating observations with valuable information. With respect to the latter, we have already shown that more systematic subsampling approaches generally outperform uniform subsampling in linear regression problems. It would be interesting to see if we can apply similar ideas to classification with imbalanced data. How exactly we can go about doing this should be straight-forward to see:
3. *Systematic undersampling*: draw $n_{y=1}$ times from from majority class $\mathbf{y}_{n_{y=0}}$ with probability $\{\pi_i\}^{n_{y=0}}_{i=0}$ defined by optimal subsampling methods and throw out the remainder.
To remain in the subsampling framework we will focus on undersampling here. It should be noted that many systematic approaches to undersampling already exist. In their extensive survey @branco2015survey mention undersampling methods based on distance-criteria, condensed nearest neighbours as well as active learning methods. Optimal subsampling is not mentioned in the survey.
## Optimal subsampling for classification problems
We cannot apply the methods used for subsampling with linear regression directly to classification problems, although we will see that certain ideas still apply. @wang2018optimal explore optimal subsampling for large sample logistic regression - a paper that is very much related to @zhu2015optimal. We will not cover the details here as much as we did for @zhu2015optimal and instead jump straight to one of the main results. @wang2018optimal and propose a two-step algorithm that in the first step estimates logistic regression with uniform subsampling using $m_0$ observations. In the second step the estimated coefficient $\beta^{(UNIF)}_{m_0}$ from the first step is used to compute subsampling probabilities as follows:
$$
\begin{equation}
\begin{aligned}
&& \pi_i&= \frac{|y_i-p_i(\beta^{(UNIF)}_{m_0})|||\mathbf{X}_i||}{\sum_{j=1}^n|y_i-p_i(\beta^{(UNIF)}_{m_0})|||\mathbf{X}_j||}\\
\end{aligned}
(\#eq:subs_mVc)
\end{equation}
$$
The probabilities are then used to subsample $m$ observations again. Finally, $\hat\beta_{m_0+m}$ is estimated from the combdined subsamples with weighted logistic regression where weights once again correspond to inverse sampling probabilities. The authors refer to the method corresponding to \@ref{eq:subs_mVc} as *mVc* since it corresponds to optimal subsampling with variance targeting. They also propose a method which targets the MSE directly, but this one is of higher computational costs $\mathcal{O}(np^2)$ vs. $\mathcal{O}(np)$ for *mVc*. The similarities to the case for linear regression should be obvious. We will only consider *mVc* here and see if it can be applied to undersampling. The code that implements this can be found [here](R/undersampling.R) - it uses object-oriented program so could be interesting for some readers. @wang2018optimal never run the model with ordinary as opposed to weighted logit, possibly because ordinary logit is not sensible in this context. But for the sake of completeness we'll once again consider both cases here.
## Synthetic data {#class-syn}
```{r}
n <- length(data_sets$GA$y)
prob <- 0.05
```
In this section we will use the same synthetic data sets as above for our design matrix $\mathbf{X}$. But while above we sampled $y_i |x_i\sim \mathcal{N}(\beta^T x_i,\sigma)$, here we will simulate a single draw from $\mathbf{y} |\mathbf{X} \sim \text{Bernoulli}(p)$, where in order to create imbalance we let $p<0.5$ vary. To model the probability of $y=1$ we will use logistic regression:
$$
\begin{equation}
\begin{aligned}
&& \mathbf{p}&= \frac{ \exp( \mathbf{X} \beta )}{1 + \exp(\mathbf{X} \beta)}
\end{aligned}
(\#eq:log-reg)
\end{equation}
$$
Equation \@ref(eq:log-reg) is not estimated directly but rather derived from linear predictions
$$
\begin{equation}
\begin{aligned}
\log \left( \frac{\mathbf{p}}{1-\mathbf{p}} \right)&= \mathbf{X} \beta \\
\end{aligned}
(\#eq:lin-pred)
\end{equation}
$$
where $\beta$ can be estimated through iterative re-weighted least-squares (IRLS) which is a simple implementation of Newton's method (see for example @wasserman2013all; a complete derivation can also be found in the [appendix](#irls)):
$$
\begin{equation}
\begin{aligned}
&& \beta_{s+1}&= \left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right) \mathbf{X}^T \mathbf{W}z\\
\text{where}&& z&= \mathbf{X}\beta_{s} + \mathbf{W}^{-1} (\mathbf{y}-\mathbf{p}) \\
&& \mathbf{W}&= \text{diag}\{p_i(\beta_{s})(1-p_i(\beta_{s}))\}_{i=1}^n \\
\end{aligned}
(\#eq:irls)
\end{equation}
$$
In R this can be implemented from scratch as below. For the empirical exercises we will rely on `glm([formula], family="binomial")` which scales much better to higher dimensional problems than my custom function and also implements weighted logit.
```{r class.source = "fold-show", code=readLines("R/logit_irls.R")}
```
When thinking about repeating the empirical exercise from Section \@ref(lin-reg) above, we are faced with the question of how to compute the MSE and its bias-variance decomposition. One idea could be to compare the linear predictions in \@ref(eq:lin-pred) in the same way as we did before, since after all these are fitted values from a linear model. The problem with that idea is that rebalancing the data through undersampling affects the intercept term $\beta_0$ (potentially quite a bit if the data was originally very imbalanced). This is not surprising since $\beta_0$ measures the log odds of $y_i$ given all predictors $\mathbf{X}_i=0$ are zero. Hence comparing linear predictors $\mathbf{X}\beta_n$ and $\mathbf{X}\beta_m$ is not a good idea in this setting (perhaps ever?). Fortunately though we can just work with the predicted probabilities in \@ref(eq:log-reg) directly and decompose the MSE in the same way as before (see @manning2008introduction, pp. 310):
$$
\begin{equation}
\begin{aligned}
&& \mathbb{E} \left( (\hat{\mathbf{p}}-\mathbf{p})^2 \right)
&= \text{var} (\hat{\mathbf{p}}) + \left( \mathbb{E} \left( \hat{\mathbf{p}} \right) - \mathbf{p} \right)^2 \\
\end{aligned}
(\#eq:mse-prob)
\end{equation}
$$
Hence the empirical exercise in this section is very similar to the one above and can be summarized as follows:
**Empirical exercise**
1. Generate synthetic data $\mathbf{X}$ of dimension $(n \times m)$ with $n>>m$.
2. Randomly generate the binary outcome variable as $\mathbf{y} | \mathbf{X} \sim \text{Bernoulli}(p=m/n)$.
3. Estimate the full-sample logit estimator $\hat\beta_n$.
4. Use one of the undersampling methods to estimate recursively $\{\hat\beta^{(b)}_m\}^B_{b=1}$.
5. Evaluate average model performance of $\hat\beta_m$ under the mean-squared error criterium: $MSE= \frac{1}{B} \sum_{b=1}^{B} MSE^{(b)}$ where $MSE^{(b)}$ corresponds to the in-sample estimator of the mean-squared error of the $b$-th iteration.
There is a decisive difference between this exercise and the one we ran for linear regression models above: here we are mechanically reducing the number of observations of the majority class and with non-uniform subsampling we are selective about what observations we keep. This is very different to simply subsampling without imposing a structure on the subsample. Intuitively it may well be that in this context non-uniform subsampling is prone to produce estimates of $\hat\beta_m$ that are very different from $\hat\beta_n$. This effect could be expected to appear in the squared bias term of the MSE.
The resulting mean-squared error decompositions are shown in Figures \@ref(fig:wls-logit) and \@ref(fig:ols-logit) for weighted and ordinary logit. A first interesting and reassuring observation is that both for weighted and ordinary logit the variance component is generally reduced more with *mVc* subsampling (certainly less true for weighted logit). A second observation is that the squared bias term can evidently not be controlled with *mVc* which leads to the overall mean-squared errors generally being higher than with random undersampling. This could be evidence that selective subsampling in fact introduces bias as speculated above, but this is not clear and would have to be explored more thoroughly.
*It could also very well be that the algorithm in @wang2018optimal was not implemented correctly here and the results are therefore incorrect. Unfortunately I have not had time to try and completely replicate that second paper also.*
```{r}
# Simulation parameters:
data_sets <- tryCatch(
data_sets,
error = function(e) {
readRDS("data/synthetic_data.rds")
}
)
n <- 1000
m <- 2^(3:7)*1e-3
weighted <- c(T,F)
methods <- list(
"UNIF" = UNIF,
"mVc" = mVc
)
pgrid <- data.table(
expand.grid(
m=m,
method=names(methods),
weighted=weighted,
data_set=names(data_sets)
)
)
```
```{r, eval=F}
set.seed(1)
grid_search <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
m <- pgrid[i,m]
estimator_name <- pgrid[i,method]
estimator <- methods[[estimator_name]]
weighted <- pgrid[i, weighted]
data_set <- pgrid$data[i]
X <- data_sets[[data_set]]$X
y <- sample(c(1,0),size=n,rep=T,prob = c(m,1-m))
vars <- undersampler(X,y) # undersampler instance
output <- sim_undersampling(vars, estimator, weighted=weighted)
output[,m:=m]
output[,method:=estimator_name]
output[,weighted:=weighted]
output[,data_set:=data_set]
}
)
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("Logit","Weighted Logit")
grid_search[,log_value:=log(value)]
saveRDS(grid_search, file="outputs/grid_search_logit.rds")
```
```{r}
grid_search <- readRDS("outputs/grid_search_logit.rds")
p_list <- lapply(
1:2,
function(i) {
ggplot(
data=grid_search[weighted==levels(weighted)[i]],
aes(x=m, y=value, colour=method)
) +
geom_line() +
geom_point() +
facet_grid(
rows = vars(variables),
cols = vars(data_set),
scales = "free_y"
) +
scale_color_discrete(
name="Subsampling:"
) +
labs(
x="Proportion of sample",
y="Value",
title=levels(grid_search$weighted)[i]
)
}
)
names(p_list) <- levels(grid_search$weighted)
```
```{r wls-logit, fig.height=5, fig.width=10, fig.cap="MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with weighted logistic regression. MSE and its components correspond to linear predictions of logistic regression."}
gg <- ggplotly(p_list$`Weighted Logit`)
layout_ggplotly(gg, y=-0.05)
```
```{r ols-logit, fig.height=5, fig.width=10, fig.cap="MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with ordinary logistic regression. MSE and its components correspond to linear predictions of logistic regression."}
gg <- ggplotly(p_list$Logit)
layout_ggplotly(gg, y=-0.05)
```
## Real data example
Until now we have merely looked at approximations of $\hat\beta_n$ for logistic regression with imbalanced data. But ultimately the goal in classification problems is to accurately predict $\mathbf{y}$ from test data. To investigate this point we will next turn to real data.
```{r, eval=F}
dt <- fread("data/latestdata.csv")
dt <- dt[,.(ID, age, sex, latitude, longitude, outcome)]
for(col in names(dt)) {
set(dt, i=which(dt[[col]]==""), j=col, value=NA)
}
dt <- na.omit(dt)
dt[,patient_id:=.I][,ID:=NULL]
# Preprocess outcome variable:
death_states <- unique(dt$outcome)[grepl("die",unique(dt$outcome), ignore.case = T) |
grepl("dea",unique(dt$outcome), ignore.case = T) |
grepl("decea",unique(dt$outcome), ignore.case = T)]
dt[,y:=ifelse(outcome %in% death_states,1,0)][,outcome:=NULL]
# Preprocess features:
dt[,age:=mean(as.numeric(unlist(strsplit(age,"-")))),by=patient_id]