Chris's wishlist

This is @ck37's draft wishlist, generally under revision. Feel free to add/edit.

Wrappers

Reporting

SuperLearner: SE of risk estimate for learners in the results table, just like CV.SL.
- See ck37r::sl_stderr
Weight table for CV.SuperLearner (see ck37r::cvsl_weights)
Better default output for print.CV.SuperLearner
Overfitting diagnostic(s)
- Estimated training set error for each learner
- Ratio & absolute difference with test set error?
- Meta-learner support to set overfitting learners to zero weight (e.g. when using TMLE rather than CV-TMLE)?

Visualization

SuperLearner: plot just like CV.SuperLearner but without SL and Discrete SL.
- Working code in ck37r::plot.SuperLearner
- Could plot empirical risk for SL and Discrete SL, but warn about bias
Partial dependence plots and/or individual conditional expectation
CV.SuperLearner plot should insert the # of folds, rather than print "V-Fold".

Methodology

Option to return training split-specific fits, to support revere SL.
Option to not re-fit to full data, to support reverse SL.
SL meta-learner should work even if some learners are NAs.
Multinomial classification - integrate origami wrappers
Support for repeated cross-validation
- And ability to add repeats after the initial fit.
sl_fit_libraries and sl_analyze_libraries for fitting & analyzing sets of libraries.
Iteratively add learners to the ensemble, esp. for cumulative analysis and iterative hyperparameter optimization
- Or remove learners (e.g. overfitting ones)
RF-style permutation-based variable importance analysis for SL and CV.SL
CV.SuperLearner with multiple meta-learners fit for comparison.
- Also for SuperLearner
create.Learner() - option to pass in custom grid, e.g. when you want to use a restricted set of hyperparameter combinations.
Random hyperparameter search, possibly via create.Learner()
More hyperparameter search: rHyperband, bayesian
AutoML like h2o, TPOT, etc.
Apply multiple screeners in a sequence.
Data pre-processing step prior to feature selection.

Architecture

Unify parallelization under SuperLearner() - deprecate mcSuperLearner and snowSuperLearner
Support nested (external) cross-validation directly within SuperLearner() - deprecate CV.SuperLearner
Review function arguments for SuperLearner and CV.SuperLearner

Usability

Quality / Organization

test_wrapper() to run a battery of standard tests on each wrapper.
90%+ test coverage
Metalearner functions in method.R should be in separate files.
Function to list all methods, like listWrappers() (or add to listWrappers()).
Metalearner functions need documentation and examples.
Review ?SuperLearner and ?CV.Superlearner.
Display CV iteration at the beginning of each fold.
Report CV time elapsed at the end of each fold.
Move all wrappers into R/wrappers
Move all screeners into R/screeners
Pass linting
Convert Rd files to Roxygen
Cheatsheet
Output fold-specific learner performance and timing during fitting when verbose = T

Screeners

Check that all screening functions exist before fitting any screeners. Allows for faster debugging.
Screening by correlation could be improved for binary outcomes (polychoric).

Performance

Improve parallelization approach: SL currently parallelizes over each fold, rather than each fold * each learner.
Time each algorithm, report time and percentage of total computation time.
Fix learners that save model data in their objects unnecessarily, increasing RAM. (via Jeremy)
Time full function execution automatically.
Parallel prediction
Rcpp implementation if it will help with anything
- screen.corP and screen.corRank on large+wide datasets may benefit

Low priority

Support OpenML: https://arxiv.org/abs/1701.01293
Diversity metrics, e.g. Yule's Q-statistic.
- See Meynet, J., & Thiran, J. P. (2010). Information theoretic combination of pattern classifiers. Pattern Recognition, 43(10), 3412-3421.
- Unclear how to implement or if this is actually useful.