-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
28 lines (16 loc) · 1.46 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
---
title: "README"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# optimalSubsampling
This project investigates if and how systematic subsampling can be applied to imbalanced learning.
All details can be found in this [Jupyter notebook](notebook.ipynb) - good if you want a condensed, interactive version and like working with Jupyter notebooks. For a better reading experience I would recommend using the more detailed [HTML](model_selection.html). A quick overview is provided below.
## Overview
The case for subsampling involves $n >> p$, so very large values of $n$. In such cases we may be interested in estimating model coefficients $\hat\beta_m$ instead of $\hat\beta_n$ where $p\le m<<n$ with $m$ freely chosen by us. In practice we may want to do this to avoid high computational costs associated with large $n$ as discussed above. The basic algorithm for estimating $\hat\beta_m$ is simple:
1. Subsample with replacement from the data with some sampling probability $\{\pi_i\}$.
2. Estimate least-squares estimator $\hat\beta_m$ using the subsample.
Here we look at a few of the different subsampling methods investigated and proposed in Zhu et al, 2015, which differ primarily in their choice of subsampling probabilities $\{\pi_i\}$. The baseline results from Zhu et al, 2015, are replicated here and consistent with the authors' findings: systematic subsampling can greatly improve model performance.
![](www/mse.png)