optimal_subsampling.Rmd

---
title: "Optimal subsampling applied to imbalanced learning"
subtitle: "Model selection"
author: "Patrick Altmeyer"
date: "`r format(Sys.Date(), '%d %B, %Y')`"
output: 
  bookdown::html_document2:
    code_folding: hide
    number_sections: true
    toc: true
    toc_float: true
bibliography: bib.bib
---

```{r setup, include=F}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE, message = F, warning = F)
```

**Note**: *By default most code chunks are hidden. "Code" buttons can be used to unhide the code.*

This note investigates if and how systematic subsampling can be applied to imbalanced learning. The structure is as follows: to set the stage for the remainder of the analysis the first [section](#bias) briefly introduces the bias-variance trade-off. The following [section](#subsampling) then introduces various subsampling methods. Section \@ref(lin-reg) illustrates the improvements associated with non-uniform subsampling. The final two sections apply the developed ideas to binary classification problems with imbalanced training data. Section \@ref(concl) concludes.

# Bias-variance tradeoff {#bias}

```{r}
source("R/gauss_kernel.R")
source("R/sinusoidal.R")
source("R/sim_sinusoidal.R")
source("R/regularized_ls.R")
# Generate data: ----
n <- 25
p <- 24
a <- 0
b <- 1
x <- seq(a,b,length.out = n)
y <- sinusoidal(x)
v <- 0.3
```

To set the stage for the remainder of this note we will briefly revisit the bias-variance trade-off in this section. In particular we will illustrate the effect of varying the sample size $n$. Readers familiar with this topic may choose to skip this section.

As as in @bishop2006pattern we consider synthetic data generated by the sinusoidal function $f(x)=\sin(2\pi x)$. To simulate random samples of $\mathbf{y}$ we sample $n$ input values from $\mathbf{X} \sim \text{unif}(0,1)$ and introduce a random noise component $\varepsilon \sim \mathcal{N}(0,`r v`)$. Figure \@ref(fig:p-sim) shows $\mathbf{y}$ along with random draws $\mathbf{y}^*_n$.

```{r p-sim, fig.cap="Sinusoidal function and random draws."}
# True data: ----
library(ggplot2)
library(data.table)
dt_true <- data.table(y,x)
pl <- ggplot(data=dt_true, aes(x=x, y=y)) +
  geom_line()
# Simulate data: ----
n_draws <- 3
dt_star <- rbindlist(
  lapply(
    1:n_draws,
    function(x) {
      simulated <- sim_sinusoidal(n=n, sigma = v)
      data.table(y=simulated$y_star, x=simulated$x_star, n=1:n, draw=x)
    }
  )
)
pl +
  geom_point(
    data = dt_star,
    aes(x=x, y=y, colour=factor(draw))
  ) +
  scale_colour_discrete(
    name="Draw:"
  )
```

```{r param-setup}
lambda <- c(
  exp(2.6),
  exp(-0.31),
  exp(-2.4)
)
s <- 0.1
n_draws <- 100
mu <- seq(a,b,length.out = p)
```

Following @bishop2006pattern we will use a Gaussian linear model with Gaussian kernels $\exp(-\frac{(x_k-\mu_p)^{2}}{2s^2})$ as 

$$
\begin{equation} 
\begin{aligned}
&& \mathbf{y}|\mathbf{X}& =f(x) \sim \mathcal{N} \left( \sum_{j=0}^{p-1} \phi_j(x)\beta_j, v \mathbb{I}_p \right) \\
\end{aligned}
(\#eq:model)
\end{equation}
$$

with $v=`r v`$ to estimate $\hat{\mathbf{y}}_k$ from random draws $\mathbf{X}_k$. We fix the number of kernels $p=`r p`$ (and hence the number of features $M=p+1=`r p+1`$) as well as the spatial scale $s=`r s`$. To vary the complexity of the model we use a form of regularized least-squares (*Ridge regression*) and let the regularization parameter $\lambda$ vary

$$
\begin{equation} 
\begin{aligned}
&& \hat\beta&=(\lambda I + \Phi^T \Phi)^{-1}\Phi^Ty \\
\end{aligned}
(\#eq:reg-ls)
\end{equation}
$$

where high values of $\lambda$ in \@ref(eq:reg-ls) shrink parameter values towards zero. (Note that a choice $\lambda=0$ corresponds to the OLS estimator which is defined as long as $p \le n$.)

As in @bishop2006pattern we proceed as follows for each choice of $\lambda$ and each sample draw to illustrate the bias-variance trade-off:

1. Draw $N=`r n`$ time from $\mathbf{u}_k \sim \text{unif}(`r a`,`r b`)$. 
2. Let $\mathbf{X}_k^*=\mathbf{u}_k+\varepsilon_k$ with $\varepsilon \sim \mathcal{N}(0, `r v`)$.
3. Compute $\mathbf{y}_k^*=\sin(2\pi \mathbf{X}^*_k)$.
4. Extract features $\Phi_k$ from $\mathbf{X}_k^*$ and estimate the parameter vector $\beta_k^*(\Phi_k,\mathbf{y}^*_k,\lambda)$ through regularized least-squares. 
5. Predict $\hat{\mathbf{y}}_k^*=\Phi \beta_k^*$.

```{r}
Phi <- cbind(
  rep(1,n),
  sapply(
    1:length(mu),
    function(p) {
      mu_p <- mu[p]
      gauss_kernel(x=x, mu=mu_p, s = s)
    }
  )
)
dt <- rbindlist(
  lapply( # loop - draw K times
    1:n_draws,
    function(k) {
      # Draw:
      simulated <- sim_sinusoidal(n=n, sigma = v)
      y_k <- simulated$y_star
      x_k <- simulated$x_star
      rbindlist(
        lapply( # loop over regularization parameter
          1:length(lambda),
          function(t) {
            # Regularization parameter:
            lambda_t <- lambda[t]
            # Extract features:
            Phi_k <- cbind(
              rep(1,n),
              sapply(
                1:length(mu),
                function(p) {
                  mu_p <- mu[p]
                  gauss_kernel(x=x_k, mu=mu_p, s = s)
                }
              )
            )
            beta_hat <- regularized_ls(Phi_k,y_k,lambda_t) # fit model on (y,x)
            y_hat <- c(Phi %*% beta_hat) # predict from model
            dt <- data.table(value=y_hat,draw=k,lambda=lambda_t,n=1:n,x=x)
            return(dt)
          }
        )
      )
    }
  )
)
dt[,facet_group:="single"]
dt[,colour_group:="estimates"]
# Expected values:
dt_exp = dt[,.(value=mean(value)),by=.(lambda,n,x)]
dt_exp[,facet_group:="aggregate"]
dt_exp[,colour_group:="estimates"]
dt_exp[,draw:=1] # just for aesthetics
# True values:
library(reshape)
dt_true = data.table(expand.grid.df(data.frame(value=y,x=x),data.frame(lambda=lambda)))
dt_true[,facet_group:="aggregate"]
dt_true[,colour_group:="true"]
dt_true[,draw:=2] # just for aesthetics
# Plot data:
dt_plot = rbind(
  dt,
  dt_exp,
  dt_true,
  fill=T
)
```

Applying the above procedure we can construct the familiar picture that demonstrates how increased model complexity increases variance while reducing bias (Figure \@ref(fig:plot-bias-var)). Recall that for the mean-squared error (MSE) we have 

$$
\begin{equation} 
\begin{aligned}
&& \mathbb{E} \left( (\hat{f}_n(x)-f(x))^2 \right)
&= \text{var} (\hat{f}_n(x)) + \left( \mathbb{E} \left( \hat{f}_n(x) \right) - f(x) \right)^2 \\
\end{aligned}
(\#eq:mse)
\end{equation}
$$

where the first term on the right-hand side corresponds to the variance of our prediction and the second term to its (squared) bias. In Figure \@ref(fig:plot-bias-var) as model complexity increases the variance component of the MSE increases, while the bias term diminishes. A similar pattern would have been observed if instead of using regularization we had used OLS and let the number of Gaussian kernels (and hence the number of features $p$) vary where higher values of $p$ correspond to increased model complexity.

```{r plot-bias-var, fig.cap="Bias-variance trade-off"}
dt_plot[,log_lambda := log(lambda)]
ggplot(data=dt_plot[draw<=25], aes(y=value, x=x, colour=colour_group, group=draw)) +
  geom_line() +
  facet_grid(
    rows = vars(log_lambda),
    cols = vars(facet_group)
  ) +
  scale_color_discrete(
    name="Type:"
  ) +
  labs(
    y = "f(x)"
  )
```

```{r sim-change-n}
n <- 100
m <- c(0.10, 0.5, 0.90)
x <- seq(a,b,length.out = n)
y <- sinusoidal(x)
lambda <- exp(-0.31)
Phi <- cbind(
  1,
  sapply(
    1:length(mu),
    function(p) {
      mu_p <- mu[p]
      gauss_kernel(x=x, mu=mu_p, s = s)
    }
  )
)
dt <- rbindlist(
  lapply( # loop - draw K times
    1:n_draws,
    function(k) {
      rbindlist(
        lapply( # loop over sample sizes:
          1:length(m),
          function(t) {
            n_sample <- m[t] * n
            # Draw:
            simulated <- sim_sinusoidal(n=n_sample, sigma = v)
            y_k <- simulated$y_star
            x_k <- simulated$x_star
            # Extract features:
            Phi_k <- cbind(
              1,
              sapply(
                1:length(mu),
                function(p) {
                  mu_p <- mu[p]
                  gauss_kernel(x=x_k, mu=mu_p, s = s)
                }
              )
            )
            beta_hat <- regularized_ls(Phi_k,y_k,lambda) # fit model on (y,x)
            y_hat <- c(Phi %*% beta_hat) # predict from model
            dt <- data.table(value=y_hat,draw=k,sample_size=n_sample,n=1:n,x=x)
            return(dt)
          }
        )
      )
    }
  )
)
dt[,facet_group:="single"]
dt[,colour_group:="estimates"]
# Expected values:
dt_exp = dt[,.(value=mean(value)),by=.(sample_size,n,x)]
dt_exp[,facet_group:="aggregate"]
dt_exp[,colour_group:="estimates"]
dt_exp[,draw:=1] # just for aesthetics
# True values:
library(reshape)
dt_true = data.table(expand.grid.df(data.frame(value=y,x=x),data.frame(sample_size=m*n)))
dt_true[,facet_group:="aggregate"]
dt_true[,colour_group:="true"]
dt_true[,draw:=2] # just for aesthetics
# Plot data:
dt_plot = rbind(
  dt,
  dt_exp,
  dt_true,
  fill=T
)
```

The focus of this note is instead on varying the sample size $n$. It should not be surprising that both the variance and bias component of the MSE decrease as the sample size $n$ increases (Figure \@ref(fig:plot-bias-var-n)). But in today's world $n$ can potentially be very large, so much so that even computing simple linear models can be hard. Suppose for example you wanted to use patient data that is generated in real-time as a global pandemic unfolds to predict the trajectory of said pandemic. Or consider the vast quantities of potentially useful user-generated data that online service providers have access to. In the remainder of this note we will investigate how systematic subsampling can help improve model accuracy in these situations.

```{r plot-bias-var-n, fig.cap="Bias-variance trade-off"}
ggplot(data=dt_plot[draw<=25], aes(y=value, x=x, colour=colour_group, group=draw)) +
  geom_line() +
  facet_grid(
    rows = vars(sample_size),
    cols = vars(facet_group)
  ) +
  scale_color_discrete(
    name="Type:"
  ) +
  labs(
    y = "f(x)"
  )
```

# Subsampling methods {#subsampling}

```{r}
source("R/UNIF.R")
source("R/BLEV.R")
source("R/OPT.R")
source("R/PL.R")
source("R/wls_qr.R")
source("R/rorthogonal.R")
source("R/sim_subsampling.R")
n <- 1000
m <- 10
```

The case for subsampling generally involves $n >> p$, so very large values of $n$. In such cases we may be interested in estimating $\hat\beta_m$ instead of $\hat\beta_n$ where $p\le m<<n$ with $m$ freely chosen by us. In practice we may want to do this to avoid high computational costs associated with large $n$ as discussed above. The basic algorithm for estimating $\hat\beta_m$ is simple:

1. Subsample with replacement from the data with some sampling probability $\{\pi_i\}^n_{i=1}$.
2. Estimate least-squares estimator $\hat\beta_m$ using the subsample. 

But there are at least two questions about this algorithm: firstly, how do we choose $\mathbf{X}_m=({\mathbf{X}^{(1)}}^T,...,{\mathbf{X}^{(m)}}^T)^T$? Secondly, how should we construct $\hat\beta_m$? With respect to the former, a better idea than just randomly selecting $\mathbf{X}_m$ might be to choose observations with high influence. We will look at a few of the different subsampling methods investigated and proposed in @zhu2015optimal, which differ primarily in their choice of subsampling probabilities $\{\pi_i\}^n_{i=1}$:

1. Uniform subsampling (UNIF): $\{\pi_i\}^n_{i=1}=1/n$.
2. Basic leveraging (BLEV): $\{\pi_i\}^n_{i=1}=h_{ii}/ \text{tr}(\mathbf{H})=h_{ii}/p$ where $\mathbf{H}$ is the *hat matrix*.
3. Optimal (OPT) and predictor-length sampling (PL): involving $||\mathbf{X}_i||/ \sum_{j=1}^{n}||\mathbf{X}_j||$ where $||\mathbf{X}||$ denotes the $L_2$ norm of $\mathbf{X}$.

Methods involving predictor-lengths are proposed by the authors with the former shown to be optimal (more on this below). PL subsampling is shown to scale very well and a good approximation of optimal subsampling conditional on leverage scores $h_{ii}$ being fairly homogeneous.

With respect to the second question @zhu2015optimal investigate both ordinary least-squares (OLS) and weighted least-squares (WLS), where weights simply correspond to subsampling probabilities $\{\pi_i\}^n_{i=1}$. The authors present empirical evidence that OLS is more efficient than WLS in that the mean-squared error (MSE) for predicting $\mathbf{X} \beta$ is lower for OLS. The authors also note though that subsampling using OLS is not consistent for non-uniform subsampling methods meaning that the bias cannot be controlled. Given Equation \@ref(eq:mse) the fact that OLS is nonetheless more efficient than WLS implies that the higher variance terms associated with WLS dominates the effect of relatively higher bias with OLS. In fact this is consistent with the theoretical results presented in @zhu2015optimal (more on this below).

Next we will briefly run through different estimation and subsampling methods in some more detail and see how they can be implemented in R. In the following [section](#lin-reg) we will then look at how the different approaches perform empirically.

## OLS and WLS

Both OLS and WLS are implemented here using QR decomposition. As for OLS this is very easily done in R. Given some feature matrix `X` and a corresponding outcome variable `y` we can use `qr.solve(X, y)` to compute $\hat\beta$. For WLS we need to first weigh observations by their corresponding subsampling probabilities. Following @zhu2015optimal we can construct a weighting matrix $\Phi= \text{diag}\{\pi_i\}^m_{i=1}$ and compute the weighted least-squares estimator as: (see [appendix](#app-wls) for derivation)

$$
\begin{equation} 
\begin{aligned}
&& \hat\beta_m^{WLS}&= \left( \mathbf{X}^T \Phi^{-1} \mathbf{X} \right)^{-1} \mathbf{X}^T\Phi^{-1}\mathbf{y}\\
\end{aligned}
(\#eq:wls)
\end{equation}
$$

In R weighted least-squares can be implemented as follows

```{r class.source = "fold-show", code=readLines("R/wls_qr.R"), echo=T}
```

where in order to implement the algorithm propose in @zhu2015optimal the weights we need to supply as the function arguments are $w=\{1/\pi_i\}^m_{i=1}$. This follows from the following property of diagonal matrices:

$$
\begin{aligned}
&& \Phi^{-1}&= \begin{pmatrix}
1\over\phi_{11} & 0 & 0 \\
0 & ... & 0 \\
0 & 0 & 1\over \phi_{nn}
\end{pmatrix}
 \\
\end{aligned}
$$

## Uniform leveraging (UNIF)

A simple function for uniform leveraging in R is shown in the code chunk below. Note that to streamline the comparison of the different methods in the following [section](#lin-reg) the function takes an unused argument `weighted=F` which for the other subsampling methods can be used to determine whether OLS or WLS should be used. Of course, with uniform leveraging the weights are all identical and hence $\hat\beta^{OLS}=\hat\beta^{WLS}$ so the argument is passed to but not evaluated in `UNIF`.

```{r class.source = "fold-show", code=readLines("R/UNIF.R"), echo=T}
```

## Basic leveraging (BLEV)

The `UNIF` function can be extended easily to the case with basic leveraging (see code below). Note that in this case the `weighted` argument is evaluated. 

```{r class.source = "fold-show", code=readLines("R/BLEV.R"), echo=T}
```

### A note on computing leverage scores

Recall that for the *hat matrix* we have

$$
\begin{equation} 
\begin{aligned}
&& \mathbf{H}&=\Phi (\Phi^T \Phi)^{-1}\Phi^T \\
\end{aligned}
(\#eq:hat-mat)
\end{equation}
$$

where the diagonal elements $h_{ii}$ correspond to the leverage scores we're after. Following @zhu2015optimal we will use (compact) singular value decomposition to obtain $\mathbf{H}$ rather than computing \@ref(eq:hat-mat) directly. This has the benefit that there exist exceptionally stable numerical algorithms to compute SVD. To see how and why we can use SVD to obtain $\mathbf{H}$ see [appendix](#app-svd).

Clearly to get $h_{ii}$ we first need to compute $\mathbf{H}$ which in terms of computational costs is of order $\mathcal{O}(np^2)=\max(\mathcal{O}(np^2),\mathcal{O}(p^3))$. The fact that we use all $n$ rows of $\Phi$ to compute leverage scores even though we explicitly stated our goal was to only use $m$ observations may rightly seem like a bit of a paradox. This is why fast algorithms that approximate leverage scores have been proposed. We will not look at them specifically here mainly because the PL method proposed by @zhu2015optimal does not depend on leverage scores and promises to be computationally even more efficient.

## Predictor-length sampling (PL)

The basic characteristic of PL subsampling - choosing $\{\pi_i\}^n_{i=1}= ||\mathbf{X}_i||/ \sum_{j=1}^{n}||\mathbf{X}_j||$ - was already introduced above. Again it is very easy to modify the subsampling functions from above to this case:

```{r class.source = "fold-show", code=readLines("R/PL.R")}
```

### A note on optimal subsampling (OPT)

In fact, PL subsampling is an approximate version of optimal subsampling (OPT). @zhu2015optimal show that asymptotically we have:

$$
\begin{equation} 
\begin{aligned}
&&\text{plim} \left( \text{var} (\hat{f}_n(x)) \right) > \text{plim} \left(\left( \mathbb{E} \left( \hat{f}_n(x) \right) - f(x) \right)^2 \right)  \\
\end{aligned}
(\#eq:plim)
\end{equation}
$$

Given this result minimizing the MSE (Equation \@ref(eq:mse)) with respect to subsampling probabilities $\{\pi_i\}^n_{i=1}$ corresponds to minimizing $\text{var} (\hat{f}_n(x))$. They further show that this minimization problem has the following closed-form solution:

$$
\begin{equation} 
\begin{aligned}
&& \pi_i&= \frac{\sqrt{(1-h_{ii})}||\mathbf{X}_i||}{\sum_{j=1}^n\sqrt{(1-h_{jj})}||\mathbf{X}_j||}\\
\end{aligned}
(\#eq:opt)
\end{equation}
$$

This still has computational costs of order $\mathcal{O}(np^2)$. But it should now be clear why PL subsampling is optimal conditional on leverage scores being homogeneous (see [appendix](#app-pl)). PL subsampling is associated with computational costs of order $\mathcal{O}(np)$, so a potentially massive improvement. The code for optimal subsampling is shown below:

```{r class.source = "fold-show", code=readLines("R/OPT.R")}
```

### A note on computing predictor lengths

Computing the Euclidean norms $||\mathbf{X}_i||$ in R can be done explicitly by looping over the rows of $\mathbf{X}$ and computing the norm in each iteration. It turns out that this computationally very expensive. A much more efficient way of computing the vector of predictor lengths is as follows

$$
\begin{equation} 
\begin{aligned}
&& \mathbf{pl}&=\sqrt{\mathbf{X}^2 \mathbf{1}} \\
\end{aligned}
(\#eq:norm)
\end{equation}
$$

where $\mathbf{X}^2$ indicates *elements squared*, the square root is also taken *element-wise* and $\mathbf{1}$ is a $(p \times 1)$ vectors of ones. A performance benchmark of the two approaches is shown in Figure \@ref(fig:mbm-norm) below.

```{r mbm-norm, fig.cap="Benchmark of Euclidean norm computations."}
n <- 200
p <- 100
library(microbenchmark)
X <- matrix(rnorm(n*p),nrow=n)
check_for_equality <- function(values) {
  tol <- 1e-12
  error <- max(abs(values[[1]] - values[[2]]))
  error < tol
}
mb <- microbenchmark(
  "Loop" = {sapply(1:nrow(X), function(i) norm(as.matrix(X[i,]), type="f"))}, 
  "Matrix product" = {sqrt(X**2 %*% rep(1,ncol(X)))},
  check = check_for_equality
)
autoplot(mb)
```

## Comparison of methods

```{r}
n <- 1000
p <- 100
```

As discussed in @zhu2015optimal both OPT and PL subsampling tend to inflate subsampling probabilities of observations with low leverage scores and shrink those of high-leverage observations relative to BLEV. They show explicitly that this always holds for orthogonal design matrices. As a quick sense-check of the functions introduced above we can generate a random orthogonal design matrix $\mathbf{X}$ and plot subsampling probabilities with OPT and PL against those obtained with BLEV. Figure \@ref(fig:comp-methods) illustrates this relationship nicely. The design matrix $\mathbf{X}$ $(n \times p)$ with $n=`r n`$ and $p=`r p`$ was generated using SVD:

```{r class.source = "fold-show", code=readLines("R/rorthogonal.R")}
```

```{r comp-methods, fig.cap="Comparison of subsampling probabilities.", fig.height=3, fig.width=6}
X <- rorthogonal(n,p)
# Subsampling methods:
methods <- list(
  "UNIF" = UNIF,
  "BLEV" = BLEV,
  "OPT" = OPT,
  "PL" = PL
)
smpl_probs <- rbindlist(
  lapply(
    names(methods)[2:4],
    function(i) {
      method <- i
      prob <- methods[[method]](X, y=NULL, m=NULL, prob_only=T)
      return(data.table(prob=prob, method=method))
    }
  )
)
dt_plot <- rbindlist(
  lapply(
    names(methods)[3:4],
    function(i) {
      other_method <- i
      x <- smpl_probs[method=="BLEV"]$prob
      y <- smpl_probs[method==other_method]$prob
      data.table(x=x, y=y, y_var=other_method)
    }
  )
)
ggplot(dt_plot, aes(x=x, y=y)) +
  geom_abline(intercept = 0, slope=1, linetype="dashed") +
  geom_point(shape=1) +
  facet_grid(
    cols=vars(y_var)
  ) +
  labs(
    x="BLEV",
    y=""
  )
```

# Linear regression model {#lin-reg}

## A review of @zhu2015optimal

To illustrate the improvements associated with the methods proposed in @zhu2015optimal, we will briefly replicate their main empirical findings here. The evaluate the performance of the different methods we will proceed as follows:

**Empirical exercise**

1. Generate synthetic data $\mathbf{X}$ of dimension $(n \times m)$ with $n>>m$. 
2. Set some true model parameter $\beta=(\mathbf{1}^T_{\overline{m*0.6}},\mathbf{1}^T_{\underline{m*0.4}})^T$.
3. Model the outcome variable as $\mathbf{y}=\mathbf{X}\beta+\epsilon$ where $\epsilon \sim \mathcal{N}(\mathbf{0},\sigma^2 \mathbf{I}_n)$ and $\sigma=10$.
4. Estimate the full-sample OLS estimator $\hat\beta_n$ (a benchmark estimator of sorts in this setting).
5. Use one of the subsampling methods to estimate iteratively $\{\hat\beta^{(b)}_m\}^B_{b=1}$. Note that all subsampling methods are stochastic so $\hat\beta_m$ varies across iterations.
6. Evaluate average model performance of $\hat\beta_m$ under the mean-squared error criterium: $MSE= \frac{1}{B} \sum_{b=1}^{B} MSE^{(b)}$ where $MSE^{(b)}$ corresponds to the in-sample estimator of the mean-squared error of the $b$-th iteration.

This exercise - and the once that follow - are computationally expensive. For them to be feasible we have to refrain from bias-correcting the in-sample estimator of the MSE (see [appendix](#app-session) for session info). We will use a simple [wrapper function](R/sim_subsampling.R) to implement the empirical exercise in R.

```{r, include=F}
n <- 1000
p <- 3
```

As in @zhu2015optimal we will generate the design matrix $\mathbf{X}$ from 5 different distributions: 1) Gaussian (GA) with $\mathcal{N}(\mathbf{0},\Sigma)$; 2) Mixed-Gaussian (MG) with $0.5\mathcal{N}(\mathbf{0},\Sigma)+0.5\mathcal{N}(\mathbf{0},25\Sigma)$; 3) Log-Gaussian (LN) with $\log\mathcal{N}(\mathbf{0},\Sigma)$; 4) T-distribution with 1 degree of freedom (T1) and $\Sigma$; 5) T-distribution as in 4) but truncated at $[-p,p]$. All parameters are chosen in the same way as in @zhu2015optimal with exception of $n=`r n`$ and $p=`r p`$, which are significantly smaller choices in order to decrease the computational costs. The corresponding densities of the 5 data sets are shown in Figure \@ref(fig:dens) in the [appendix](#app-dens). 

We will run the empirical exercise for each data set and each subsampling method introduced above. Figure \@ref(fig:plot-smpl-prob) shows logarithms of the sampling probabilities corresponding to the different subsampling methods (UNIF not shown for obvious reasons). The plots look very similar to the one in @zhu2015optimal and is shown here primarily to reassure ourselves that we have implemented their ideas correctly. One interesting observation is worth pointing out however: note how the distributions for OPT and PL have lower standard deviations compared to BLEV. This should not be altogether surprising since we already saw above that for orthogonal design matrices the former methods inflate small leverage scores while shrinking high scores. But it is interesting to see that the same appears to hold for design matrices that are explicitly not orthogonal given our choice of $\Sigma$.

```{r zhu-data}
set.seed(1)
library(expm)
matrix_grid <- expand.grid(i=1:p,j=1:p)
Sigma <- matrix(rep(0,p^2),p,p)
for (x in 1:nrow(matrix_grid)) {
  i <- matrix_grid$i[x]
  j <- matrix_grid$j[x]
  Sigma[i,j] <- 2 * (0.8)^(abs(i-j))
}
# 1.) Design matrix (as in Zhu et al): ----
GA <- matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm(t(Sigma))
# Gaussian mixture:
gaus_mix <- list(
  gaus_1 = matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm(t(Sigma)),
  gaus_2 = matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm((25 * t(Sigma)))
)
MG <- matrix(rep(0,n*p),n,p)
for (i in 1:nrow(MG)) {
  x <- sample(1:2,1)
  MG[i,] <- gaus_mix[[x]][i,]
}
# Log-Gaussian:
LN <- exp(GA)
# T-distribution:
T1 <- matrix(rt(n*p,1), nrow = n, ncol = p) %*% sqrtm(t(Sigma))
# Truncated T:
TT <- T1
TT[TT>p] <- p
TT[TT<(-p)] <- -p
data_sets <- list(
  GA = list(X = GA),
  MG = list(X = MG),
  LN = list(X = LN),
  TT = list(X = TT),
  T1 = list(X = T1)
)
# 2.) Outcome:
data_sets <- lapply(
  data_sets,
  function(i) {
    X <- i[["X"]]
    beta <- c(rep(1,ceiling(0.6*p)),rep(0.1,floor(0.4*p)))
    eps <- rnorm(n=n,mean=0,sd=10)
    y <- X %*% beta + eps
    list(X=X, y=y)
  }
)
saveRDS(data_sets, "data/synthetic_data.rds")
```

```{r smpl-prob, eval=F, include=F}
pgrid <- expand.grid(method = names(methods)[2:4], data=names(data_sets))
smpl_probs <- rbindlist(
  lapply(
    1:nrow(pgrid),
    function(i) {
      method <- as.character(pgrid$method[i])
      data_set <- pgrid$data[i]
      X <- data_sets[[data_set]]$X
      y <- data_sets[[data_set]]$y
      prob <- methods[[method]](X, y, m=NULL, prob_only=T)
      return(data.table(prob=prob, method=method, data=data_set))
    }
  )
)
saveRDS(smpl_probs, "outputs/smpl_probs.rds")
```

```{r, eval=F, fig.width=10, fig.height=4, fig.cap="Logarithm of sampling probabilities."}
smpl_probs <- readRDS("outputs/smpl_probs.rds")
pgrid <- expand.grid(method = names(methods)[3:4], data=names(data_sets))
dt_plot <- rbindlist(
  lapply(
    1:nrow(pgrid),
    function(i) {
      other_method <- as.character(pgrid$method[i])
      data_set <- pgrid$data[i]
      x <- smpl_probs[method=="BLEV" & data==data_set]$prob
      y <- smpl_probs[method==other_method & data==data_set]$prob
      data.table(x=x, y=y, y_var=other_method, data=data_set)
    }
  )
)
ggplot(dt_plot, aes(x=x, y=y)) +
  geom_abline(intercept = 0, slope=1, linetype="dashed") +
  geom_point(shape=1) +
  facet_wrap(
    y_var ~ data,
    scales = "free",
    ncol=length(data_sets)
  ) +
  labs(
    x="BLEV",
    y=""
  )
```

```{r plot-smpl-prob, fig.height=3.5, fig.width=10, fig.cap="Sampling probabilities for different subsampling methods."}
smpl_probs <- readRDS("outputs/smpl_probs.rds")
ggplot(
  data = smpl_probs,
  aes(x=data, y=log(prob)) 
) +
  geom_boxplot() +
  facet_grid(
    col=vars(method)
  )
```

```{r}
# Simulation parameters:
m <- c()
counter <- 1
while(length(m)<7 & max(m/n)<.1) {
  temp <- max(p,(2^(counter)*1e-3) * n)
  m <- unique(c(m, temp))
  counter <- counter + 1
}
weighted <- c(T,F)
pgrid <- data.table(
  expand.grid(
    m=m, 
    method=names(methods), 
    weighted=weighted, 
    data_set=names(data_sets)
  )
)
```

```{r,eval=F}
set.seed(1)
grid_search <- rbindlist(
  lapply(
    1:nrow(pgrid),
    function(i) {
      m <- pgrid[i,m]
      estimator_name <- pgrid[i,method] 
      estimator <- methods[[estimator_name]]
      weighted <- pgrid[i, weighted]
      data_set <- pgrid$data[i]
      X <- data_sets[[data_set]]$X
      y <- data_sets[[data_set]]$y
      output <- sim_subsampling(X, y, m, estimator, weighted=weighted)
      output[,m:=m]
      output[,method:=estimator_name]
      output[,weighted:=weighted]
      output[,data_set:=data_set]
    }
  )
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("OLS","WLS")
grid_search[,log_value:=log(value)]
saveRDS(grid_search, file="outputs/grid_search_zhu.rds")
```

```{r}
grid_search <- readRDS("outputs/grid_search_zhu.rds")
p_list <- lapply(
  1:2,
  function(i) {
    ggplot(
      data=grid_search[weighted==levels(weighted)[i]], 
      aes(x=m/n, y=log_value, colour=method)
    ) +
      geom_line() +
      geom_point() +
      facet_grid(
        rows = vars(variables),
        cols = vars(data_set),
        scales = "free_y"
      ) +
      scale_color_discrete(
        name="Subsampling:"
      ) +
      labs(
        x="m/n",
        y="Logarithm",
        title=levels(grid_search$weighted)[i]
      ) 
  }
)
names(p_list) <- levels(grid_search$weighted)
```

Figures \@ref(fig:wls-zhu) and \@ref(fig:ols-zhu) show the resulting MSE, squared bias and variance for the different subsampling methods and data sets using weighed least-squares and ordinary least-squares, respectively. The subsampling size increases along the horizontal axis. The figures are interactive to allow readers to zoom in etc.

For the data sets that are also shown in @zhu2015optimal we find the same overall pattern: PL and OPT outperform other methods when using weighted least-squares, while BLEV outperforms other methods when using unweighted/ordinary least-squares. 

For Gaussian data (GA) the differences between the methods are minimal since data points are homogeneous. A similar picture emerges when running the method comparison for the sinusoidal data introduced above (see [appendix](#app-sin)). In fact, @zhu2015optimal recommend to just rely on uniform subsampling when data is Gaussian. Another interesting observation is that for t-distributed data (T1) the non-uniform subsampling methods significantly outperform uniform subsampling methods. This is despite the fact that in the case of T1 data the conditions used to establish asymptotic consistency of the non-uniform subsampling methods in @zhu2015optimal are not fulfilled: in particular the fourth moment is not finite (in fact it is not defined). 

```{r wls-zhu, fig.height=5, fig.width=10, fig.cap="MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with weighted least-squares."}
library(plotly)
layout_ggplotly <- function(gg, x = -0.075, y = -0.025, x_legend=1.05, y_legend=0.95, mar= list(l=50,r=150,b=50)){
  # The 1 and 2 goes into the list that contains the options for the x and y axis labels respectively
  gg[['x']][['layout']][['annotations']][[1]][['y']] <- x
  gg[['x']][['layout']][['annotations']][[2]][['x']] <- y
  gg[['x']][['layout']][['annotations']][[11]][['x']] <- x_legend
  gg[['x']][['layout']][['legend']][['y']] <- y_legend
  gg[['x']][['layout']][['legend']][['x']] <- x_legend
  gg %>% layout(margin = mar)
}
gg <- ggplotly(p_list$WLS)
layout_ggplotly(gg)
```

```{r ols-zhu, fig.height=5, fig.width=10, fig.cap="MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with ordinary least-squares."}
gg <- ggplotly(p_list$OLS)
layout_ggplotly(gg)
```

## Computational performance

```{r}
n <- 200
p <- 100
```

We have already seen above that theoretically speaking both BLEV and OPT subsampling are computationally more expensive than PL subsampling (with UNIF subsampling the least expensive). It should be obvious that in light of their computational costs $\mathcal{O}(np^2)$ the former two methods do not scale well in higher-dimensional problems (higher $p$). @zhu2015optimal demonstrate this through empirical exercises to an extent that is beyond the scope of this note. Instead we will just quickly benchmark the different functions for non-uniform subsampling introduced above: `BLEV`, `OPT`, `PL` for $n=`r n`$, $p=`r p`$. We are only interested in how long it takes to compute subsampling probabilities, and since for `UNIF` all subsampling probabilities are simply $\pi_i=1/n$ we neglect this here. Figure \@ref(fig:mbm-methods) benchmarks the three non-uniform subsampling methods. Evidently PL subsampling is computationally much less costly.

```{r mbm-methods, fig.cap="Benchmark of computational performance of different methods."}
library(microbenchmark)
X <- matrix(rnorm(n*p),nrow=n)
y <- rnorm(n)
mb <- microbenchmark(
  "BLEV" = {BLEV(X,y,prob_only = T)}, 
  "OPT" = {OPT(X,y,prob_only = T)},
  "PL" = {PL(X,y,prob_only = T)}
)
autoplot(mb)
```

# Classification problems

```{r}
source("R/undersampling.R")
source("R/sim_undersampling.R")
source("R/logit_irls.R")
```

In binary classification problems we are often faced with the issue of imbalanced training data - one of the two classes is under-represented relative to the other. This generally makes classifiers less sensitive to the minority class which often is the class we want to predict (@branco2015survey). Suppose for example we wanted to predict the probability of death for patients who suffer from COVID-19. The case-fatality rate for the virus is significantly lower than 10% so any data we could obtain on patients would inevitably be imbalanced: the domain is skewed towards the class we are not interested in predicting.

A common and straight-forward way to deal with this issue is to randomly over- or under-sample the training data. Let $y_i\in{0,1}$ for all $i=1,...,n$. We are interested in modelling $p_i=P(y_i=1|x_i)$ but our data is imbalanced: $n_{y=0}>>n_{y=1}$ where $n=n_{y=0}+n_{y=1}$. Then random over- and under-sampling works as follows:

1. *Random oversampling*: draw $y_i$ from minority class $\mathbf{y}_{n_{y=1}}$ with probability $\{\pi_i\}^{n_{y=1}}_{i=1}=1/n_{y=1}$ and append $\mathbf{y}_{n_{y=1}}$ by $y_i$ so that $n_{y=1} \leftarrow n_{y=1}+1$ until $n_{y=0}=n_{y=1}$.
2. *Random undersampling*: draw $y_i$ from majority class $\mathbf{y}_{n_{y=0}}$ with probability $\{\pi_i\}^{n_{y=0}}_{i=0}=1/n_{y=0}$ and remove $y_i$ from $\mathbf{y}_{n_{y=0}}$ so that $n_{y=0} \leftarrow n_{y=0}-1$ until $n_{y=0}=n_{y=1}$.
  
In a way both these methods correspond to uniform subsampling (UNIF) discussed above. Random oversampling may lead to overfitting. Conversely, random undersampling may come at the cost of eliminating observations with valuable information. With respect to the latter, we have already shown that more systematic subsampling approaches generally outperform uniform subsampling in linear regression problems. It would be interesting to see if we can apply similar ideas to classification with imbalanced data. How exactly we can go about doing this should be straight-forward to see:

3. *Systematic undersampling*: draw $n_{y=1}$ times from from majority class $\mathbf{y}_{n_{y=0}}$ with probability $\{\pi_i\}^{n_{y=0}}_{i=0}$ defined by optimal subsampling methods and throw out the remainder.

To remain in the subsampling framework we will focus on undersampling here. It should be noted that many systematic approaches to undersampling already exist. In their extensive survey @branco2015survey mention undersampling methods based on distance-criteria, condensed nearest neighbours as well as active learning methods. Optimal subsampling is not mentioned in the survey.

## Optimal subsampling for classification problems

We cannot apply the methods used for subsampling with linear regression directly to classification problems, although we will see that certain ideas still apply. @wang2018optimal explore optimal subsampling for large sample logistic regression - a paper that is very much related to @zhu2015optimal. We will not cover the details here as much as we did for @zhu2015optimal and instead jump straight to one of the main results. @wang2018optimal and propose a two-step algorithm that in the first step estimates logistic regression with uniform subsampling using $m_0$ observations. In the second step the estimated coefficient $\beta^{(UNIF)}_{m_0}$ from the first step is used to compute subsampling probabilities as follows:

$$
\begin{equation} 
\begin{aligned}
&& \pi_i&= \frac{|y_i-p_i(\beta^{(UNIF)}_{m_0})|||\mathbf{X}_i||}{\sum_{j=1}^n|y_i-p_i(\beta^{(UNIF)}_{m_0})|||\mathbf{X}_j||}\\
\end{aligned}
(\#eq:subs_mVc)
\end{equation}
$$

The probabilities are then used to subsample $m$ observations again. Finally, $\hat\beta_{m_0+m}$ is estimated from the combdined subsamples with weighted logistic regression where weights once again correspond to inverse sampling probabilities. The authors refer to the method corresponding to \@ref{eq:subs_mVc} as *mVc* since it corresponds to optimal subsampling with variance targeting. They also propose a method which targets the MSE directly, but this one is of higher computational costs $\mathcal{O}(np^2)$ vs. $\mathcal{O}(np)$ for *mVc*. The similarities to the case for linear regression should be obvious. We will only consider *mVc* here and see if it can be applied to undersampling. The code that implements this can be found [here](R/undersampling.R) - it uses object-oriented program so could be interesting for some readers. @wang2018optimal never run the model with ordinary as opposed to weighted logit, possibly because ordinary logit is not sensible in this context. But for the sake of completeness we'll once again consider both cases here.

## Synthetic data {#class-syn}

```{r}
n <- length(data_sets$GA$y)
prob <- 0.05
```

In this section we will use the same synthetic data sets as above for our design matrix $\mathbf{X}$. But while above we sampled $y_i |x_i\sim \mathcal{N}(\beta^T x_i,\sigma)$, here we will simulate a single draw from $\mathbf{y} |\mathbf{X} \sim \text{Bernoulli}(p)$, where in order to create imbalance we let $p<0.5$ vary. To model the probability of $y=1$ we will use logistic regression:

$$
\begin{equation} 
\begin{aligned}
&& \mathbf{p}&= \frac{ \exp( \mathbf{X} \beta )}{1 + \exp(\mathbf{X} \beta)}
\end{aligned}
(\#eq:log-reg)
\end{equation}
$$

Equation \@ref(eq:log-reg) is not estimated directly but rather derived from linear predictions

$$
\begin{equation} 
\begin{aligned}
\log \left( \frac{\mathbf{p}}{1-\mathbf{p}} \right)&= \mathbf{X} \beta \\
\end{aligned}
(\#eq:lin-pred)
\end{equation}
$$

where $\beta$ can be estimated through iterative re-weighted least-squares (IRLS) which is a simple implementation of Newton's method (see for example @wasserman2013all; a complete derivation can also be found in the [appendix](#irls)):

$$
\begin{equation} 
\begin{aligned}
&& \beta_{s+1}&= \left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right) \mathbf{X}^T \mathbf{W}z\\
\text{where}&& z&= \mathbf{X}\beta_{s} + \mathbf{W}^{-1} (\mathbf{y}-\mathbf{p}) \\
&& \mathbf{W}&= \text{diag}\{p_i(\beta_{s})(1-p_i(\beta_{s}))\}_{i=1}^n \\ 
\end{aligned}
(\#eq:irls)
\end{equation}
$$

In R this can be implemented from scratch as below. For the empirical exercises we will rely on `glm([formula], family="binomial")` which scales much better to higher dimensional problems than my custom function and also implements weighted logit.

```{r class.source = "fold-show", code=readLines("R/logit_irls.R")}
```

When thinking about repeating the empirical exercise from Section \@ref(lin-reg) above, we are faced with the question of how to compute the MSE and its bias-variance decomposition. One idea could be to compare the linear predictions in \@ref(eq:lin-pred) in the same way as we did before, since after all these are fitted values from a linear model. The problem with that idea is that rebalancing the data through undersampling affects the intercept term $\beta_0$ (potentially quite a bit if the data was originally very imbalanced). This is not surprising since $\beta_0$ measures the log odds of $y_i$ given all predictors $\mathbf{X}_i=0$ are zero. Hence comparing linear predictors $\mathbf{X}\beta_n$ and  $\mathbf{X}\beta_m$ is not a good idea in this setting (perhaps ever?). Fortunately though we can just work with the predicted probabilities in \@ref(eq:log-reg) directly and decompose the MSE in the same way as before (see @manning2008introduction, pp. 310): 

$$
\begin{equation} 
\begin{aligned}
&& \mathbb{E} \left( (\hat{\mathbf{p}}-\mathbf{p})^2 \right)
&= \text{var} (\hat{\mathbf{p}}) + \left( \mathbb{E} \left( \hat{\mathbf{p}} \right) - \mathbf{p} \right)^2 \\
\end{aligned}
(\#eq:mse-prob)
\end{equation}
$$

Hence the empirical exercise in this section is very similar to the one above and can be summarized as follows:

**Empirical exercise**

1. Generate synthetic data $\mathbf{X}$ of dimension $(n \times m)$ with $n>>m$. 
2. Randomly generate the binary outcome variable as $\mathbf{y} | \mathbf{X} \sim \text{Bernoulli}(p=m/n)$.
3. Estimate the full-sample logit estimator $\hat\beta_n$.
4. Use one of the undersampling methods to estimate recursively $\{\hat\beta^{(b)}_m\}^B_{b=1}$. 
5. Evaluate average model performance of $\hat\beta_m$ under the mean-squared error criterium: $MSE= \frac{1}{B} \sum_{b=1}^{B} MSE^{(b)}$ where $MSE^{(b)}$ corresponds to the in-sample estimator of the mean-squared error of the $b$-th iteration.

There is a decisive difference between this exercise and the one we ran for linear regression models above: here we are mechanically reducing the number of observations of the majority class and with non-uniform subsampling we are selective about what observations we keep. This is very different to simply subsampling without imposing a structure on the subsample. Intuitively it may well be that in this context non-uniform subsampling is prone to produce estimates of $\hat\beta_m$ that are very different from $\hat\beta_n$. This effect could be expected to appear in the squared bias term of the MSE.

The resulting mean-squared error decompositions are shown in Figures \@ref(fig:wls-logit) and \@ref(fig:ols-logit) for weighted and ordinary logit. A first interesting and reassuring observation is that both for weighted and ordinary logit the variance component is generally reduced more with *mVc* subsampling (certainly less true for weighted logit). A second observation is that the squared bias term can evidently not be controlled with *mVc* which leads to the overall mean-squared errors generally being higher than with random undersampling. This could be evidence that selective subsampling in fact introduces bias as speculated above, but this is not clear and would have to be explored more thoroughly. 

*It could also very well be that the algorithm in @wang2018optimal was not implemented correctly here and the results are therefore incorrect. Unfortunately I have not had time to try and completely replicate that second paper also.*

```{r}
# Simulation parameters:
data_sets <- tryCatch(
  data_sets,
  error = function(e) {
    readRDS("data/synthetic_data.rds")
  }
)
n <- 1000
m <- 2^(3:7)*1e-3
weighted <- c(T,F)
methods <- list(
  "UNIF" = UNIF,
  "mVc" = mVc
)
pgrid <- data.table(
  expand.grid(
    m=m, 
    method=names(methods), 
    weighted=weighted, 
    data_set=names(data_sets)
  )
)
```

```{r, eval=F}
set.seed(1)
grid_search <- rbindlist(
  lapply(
    1:nrow(pgrid),
    function(i) {
      m <- pgrid[i,m]
      estimator_name <- pgrid[i,method] 
      estimator <- methods[[estimator_name]]
      weighted <- pgrid[i, weighted]
      data_set <- pgrid$data[i]
      X <- data_sets[[data_set]]$X
      y <- sample(c(1,0),size=n,rep=T,prob = c(m,1-m))
      vars <- undersampler(X,y) # undersampler instance
      output <- sim_undersampling(vars, estimator, weighted=weighted)
      output[,m:=m]
      output[,method:=estimator_name]
      output[,weighted:=weighted]
      output[,data_set:=data_set]
    }
  )
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("Logit","Weighted Logit")
grid_search[,log_value:=log(value)]
saveRDS(grid_search, file="outputs/grid_search_logit.rds")
```

```{r}
grid_search <- readRDS("outputs/grid_search_logit.rds")
p_list <- lapply(
  1:2,
  function(i) {
    ggplot(
      data=grid_search[weighted==levels(weighted)[i]], 
      aes(x=m, y=value, colour=method)
    ) +
      geom_line() +
      geom_point() +
      facet_grid(
        rows = vars(variables),
        cols = vars(data_set),
        scales = "free_y"
      ) +
      scale_color_discrete(
        name="Subsampling:"
      ) +
      labs(
        x="Proportion of sample",
        y="Value",
        title=levels(grid_search$weighted)[i]
      ) 
  }
)
names(p_list) <- levels(grid_search$weighted)
```

```{r wls-logit, fig.height=5, fig.width=10, fig.cap="MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with weighted logistic regression. MSE and its components correspond to linear predictions of logistic regression."}
gg <- ggplotly(p_list$`Weighted Logit`)
layout_ggplotly(gg, y=-0.05)
```


```{r ols-logit, fig.height=5, fig.width=10, fig.cap="MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with ordinary logistic regression. MSE and its components correspond to linear predictions of logistic regression."}
gg <- ggplotly(p_list$Logit)
layout_ggplotly(gg, y=-0.05)
```

## Real data example

Until now we have merely looked at approximations of $\hat\beta_n$ for logistic regression with imbalanced data. But ultimately the goal in classification problems is to accurately predict $\mathbf{y}$ from test data. To investigate this point we will next turn to real data.

```{r, eval=F}
dt <- fread("data/latestdata.csv")
dt <- dt[,.(ID, age, sex, latitude, longitude, outcome)]
for(col in names(dt)) {
  set(dt, i=which(dt[[col]]==""), j=col, value=NA)
}
dt <- na.omit(dt)
dt[,patient_id:=.I][,ID:=NULL]
# Preprocess outcome variable:
death_states <- unique(dt$outcome)[grepl("die",unique(dt$outcome), ignore.case = T) | 
                     grepl("dea",unique(dt$outcome), ignore.case = T) | 
                     grepl("decea",unique(dt$outcome), ignore.case = T)]
dt[,y:=ifelse(outcome %in% death_states,1,0)][,outcome:=NULL]
# Preprocess features:
dt[,age:=mean(as.numeric(unlist(strsplit(age,"-")))),by=patient_id]
dt[,age:=as.numeric(age)]
dt[,sex:=ifelse(sex=="female",1,0)]
saveRDS(dt, "data/dt_preprocessed.rds")
```

```{r}
dt <- readRDS("data/dt_preprocessed.rds")
share_ones <- table(dt$y)[2]/nrow(dt)*100
```

A number of interesting data sets related to COVID-19 have been made publicly available for research. An interesting data set for our purposes that contains information about individual patients is being maintained for the [Open COVID-19 Data Working Group](https://github.com/beoutbreakprepared/nCoV2019). The methodology for building the data set was originally developed by was developed by @xu2020Epidemiological. Information about how the data is being maintained and updated can be found in @kraemer2020epidemiological. 
The complete data set is vast, but for our purposes we are only interested in the subset of data that contains information about the patient's outcome state. The subset of the data we will use for our prediction exercise here is shown below. It contains the patient outcome variable $y\in \{0=\text{survived}, 1=\text{deceased}\}$ which we are interested in predicting from a set of features: age, gender and geographical location. The model is kept intentionally simple, as our goal here continues to be the evaluation of subsampling methods, rather than building an accurate prediction. 

```{r data-head}
library(DT)
datatable(
  dt, 
  class = 'cell-border stripe',
  filter = 'top', 
  options = list(
    pageLength = 5, 
    autoWidth = TRUE,
    scrollX = TRUE
  )
)
```

The outcome variable is defined as $y\in\{0=\text{survived},1=\text{deceased}\}$ and is highly imbalanced as expected: deaths make up about `r round(share_ones)` percent of all outcomes. Note also that we are now looking at a significantly higher value for $n$ than in the synthetic examples above. Computing leverage scores in this setting is extremely costly for reasons we have already elaborated on above and hence basic leveraging (BL) and optimal subsampling (OPT) are both omitted from the remainder of the analysis. To compare the two remaining approaches - uniform (UNIF) and prediction-length (PL) subsampling - we will proceed as follows:

```{r}
methods <- list(
  "UNIF" = UNIF,
  "mVc" = mVc
)
weighted <- c(T,F)
cv_splits <-  function(N, n_folds) {
  index <- 1:N
  shuffled <- sample(index,N,F)
  folds <- split(
    as.numeric(shuffled),
    f = rep_len(1:n_folds, N )
  )
  return(folds)
}
cv_fold <- 5
B <- 100
X_cols <- colnames(dt)[!colnames(dt)%in%c("y","patient_id")]
X <- as.matrix(dt[,.SD,.SDcols=X_cols])
y <- dt$y
pgrid <- data.table(expand.grid(method=names(methods), weighted=weighted))
```

To score our probabilistic predictions we will once again use the mean-squared error, although technically in this setting we would speak of the *Brier score* (see [here](https://en.wikipedia.org/wiki/Brier_score)):

$$
\begin{equation}
\begin{aligned}
&& BS&= \frac{1}{n} \sum_{i=1}^{n} (p_i-\hat{p}_i)^2 \\
\end{aligned}
(\#eq:brier)
\end{equation}
$$

Of course other scoring functions are commonly considered for classification problems (for a discussion see for example [here](https://stats.stackexchange.com/questions/172945/rmse-root-mean-squared-error-for-logistic-models)). But using the *Brier score* naturally continues the discussion [above](#subsampling). 

**Empirical exercise**

Cross-validate with $k=`r cv_fold`$ folds and for each fold repeat $B=`r B`$ times:

1. Undersample the training data through one of the methods: $(\mathbf{X}_{\text{train}},\mathbf{y}_{\text{train}})$.
2. Fit the model on the undersampled training data.
3. Using the fitted model predict $\hat{\mathbf{y}}_{\text{train}}$ using the test data $\mathbf{X}_{\text{test}}$.
4. Compute the Brier score for $\mathbf{y}_{\text{test}}$ and $\hat{\mathbf{y}}_{\text{train}}$.

```{r, eval=F}
grid_search <- rbindlist(
  lapply(
    1:length(folds),
    function(k) {
      # Perform train-test split
      idx_test <- as.numeric(folds[[k]])
      idx_train <- as.numeric(unlist(folds[-k]))
      X_train <- X[idx_train,]
      y_train <- y[idx_train]
      X_test <- X[idx_test,]
      y_test <- y[idx_test]
      vars <- undersampler(X_train,y_train)
      output <- rbindlist(
        lapply( # Loop over grid values:
          1:nrow(pgrid),
          function(i) {
            method_name <- pgrid[i,method]
            weighted <- pgrid[i,weighted]
            output <- rbindlist(
              lapply( # Repeat B times:
                1:B,
                function(b) {
                  Z <- methods[[method_name]](vars, fit_model=F)
                  X <- Z$X
                  y <- Z$y
                  weights <- Z$weights
                  # Train model:
                  if (weighted & method_name!="UNIF") {
                    fitted_model <- glm(y ~ X, family = "binomial",
                                        weights = weights)
                  } else {
                    fitted_model <- glm(y ~ X, family = "binomial")
                  }
                  y_hat <- cbind(1,X_test) %*% fitted_model$coefficients 
                  y_hat <- exp(y_hat)/(1+exp(y_hat))
                  output <- data.table(
                    y = y_test,
                    y_hat = c(y_hat),
                    fold = k,
                    weighted = weighted,
                    b = b,
                    method = method_name,
                    n = 1:length(y_test)
                  )
                  return(output)
                }
              )
            )
            return(output)
          }
        )
      )
      return(output)
    }
  )
)
grid_search[,brier:=mean((y_hat-y)^2,na.rm=T),by=.(method, weighted, b, fold)]
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("Logit","Weighted Logit")
saveRDS(grid_search, file="outputs/grid_search_covid_adj.rds")
```

Figure \@ref(fig:brier) shows the distribution of Brier scores resulting from our pseudo-out-of-sample predictions. Undersampling is performed both randomly and with *mVc* subsampling. The logit classifier is fitted separately with and without subsampling probabilities as weights. In the case of ordinary logit *mVc* seems to lead to slightly better predictions on average, but the improvement is not very significant. When the model is fitted with weights something interesting happens. The majority of the estimated Brier scores are lower than those obtained with random undersampling, but there is much higher variance and in some cases *mVc* performs very poorly. This could imply that using optimal subsampling for undersampling with weighted regression is risky: it performs well most of the time, but sometimes things go very wrong. Of course, this is a just a single example and the findings shown here are not sufficient to draw any conclusions.

```{r brier, fig.height=3, fig.width=6, fig.cap="Distribution of Brier scores from pseudo-out-of-sample predictions for random and prediction-length undersampling."}
grid_search <- readRDS("outputs/grid_search_covid_adj.rds")
dt_plot <- unique(grid_search[,.(method, brier, weighted)])
ggplot(
  data=dt_plot,
  aes(x=method, y=brier)
) +
  geom_boxplot() +
  facet_wrap(
    ~ weighted,
    scales="free"
  ) +
  labs(
    x="Method",
    y="Brier score"
  )
```

# Conclusion {#concl}

In this short note we have looked at various systematic subsampling methods and how they can be applied to imbalanced learning. The fact that non-uniform subsampling can improve performance of linear regression models has been well established (@zhu2015optimal). We reviewed these findings here and then applied the developed idea to a binary classification problem with imbalanced data. The findings are mixed: it does appear that optimal subsampling methods can be successfully applied to undersampling in the sense that they can improve modely accuracy. While these findings are potentially interesting they are not conclusive. To establish general results theory would need to be developed. It is also far from clear whether the undersampling approaches presented here can compete with existing methodologies that involve distance-criteria, condensed nearest neighbours and active learning methods. 

# Appendix {-}

## Weighted least-squares {- #app-wls}

![Weighted least-squares](www/wls.png)

## From SVD to leverage scores {- #app-svd}

![From SVD to leverage](www/svd_leverage.png)

## From optimal to prediction-length subsampling {- #app-pl}

![From OPT to PL](www/PL.png)

## Synthetic data {- #app-dens}

```{r dens, fig.cap="Densities of synthetic design matrices."}
data_density <- rbindlist(
  lapply(
    1:length(data_sets),
    function(i) {
      x <- c(data_sets[[i]]$X)
      x_bar <- mean(c(data_sets[[i]]$X))
      sigma <- sd(c(data_sets[[i]]$X))
      x_std <- (x - x_bar) / sigma
      data.table(value = x_std, data_set=names(data_sets)[i])
    }
  )
)
ggplot(data_density, aes(value)) +
  geom_density() +
  facet_wrap(
    ~ data_set, 
    scales = "free"
  ) +
  labs(
    x="x (standardized)",
    y="Density"
  )
```

## Subsampling applied to sinusoidal function {- #app-sin}

```{r, eval=F}
# Function parameters:
n <- 1000
p <- 9
a <- 0
b <- 1
x <- seq(a,b,length.out = n)
y_true <- sinusoidal(x)
y <- sim_sinusoidal(n)$y
v <- 0.3
mu <- seq(a,b,length.out = p)
Phi <- cbind(
  rep(1,n),
  sapply(
    1:length(mu),
    function(p) {
      mu_p <- mu[p]
      gauss_kernel(x=x, mu=mu_p, s = s)
    }
  )
)
# Simulation parameters:
m <- c()
counter <- 1
while(length(m)<7 & max(m/n)<.1) {
  temp <- max(p+1,(2^(counter)*1e-3) * n)
  m <- unique(c(m, temp))
  counter <- counter + 1
}
weighted <- c(T,F)
param_grid <- data.table(expand.grid(m=m, method=names(methods), weighted=weighted))
```

```{r, eval=F}
set.seed(1)
grid_search <- rbindlist(
  lapply(
    1:nrow(param_grid),
    function(i) {
      m <- param_grid[i,m]
      estimator_name <- as.character(param_grid[i,method])
      estimator <- methods[[estimator_name]]
      weighted <- param_grid[i, weighted]
      output <- sim_subsampling(Phi, y, m, estimator, weighted=weighted)
      output[,m:=m]
      output[,method:=estimator_name]
      output[,weighted:=weighted]
    }
  )
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("OLS","WLS")
grid_search[,log_value:=log(value)]
grid_search[,method := factor(method, levels=c("UNIF", "BLEV", "OPT", "PL"))]
saveRDS(grid_search, file="outputs/grid_search_sinus.rds")
```

```{r}
grid_search <- readRDS("outputs/grid_search_sinus.rds")
ggplot(
  data=grid_search, 
  aes(x=m/n, y=log_value, colour=method)
) +
  geom_line() +
  geom_point() +
  facet_grid(
    rows=vars(variables),
    cols = vars(weighted)
  ) +
  scale_color_discrete(
    name="Subsampling method:"
  ) +
  labs(
    x="m/n",
    y="Logarithm"
  ) 
```

## Iterative reweighted least-squares {- #irls}

![Iterative reweighted least-squares](www/irls.png)

## Session info {- #app-session}

```{r}
utils::sessionInfo()
```

# References {-}