-
Notifications
You must be signed in to change notification settings - Fork 73
Chris's wishlist
Chris Kennedy edited this page Aug 23, 2017
·
57 revisions
This is @ck37's draft wishlist, generally under revision. Feel free to add/edit.
Wrappers
- New wrappers
- h2o wrappers via Erin
- LightGBM
- Rborist
- Adaptive Lasso
- mxnet
- keras
- naive bayes (klaR and/or e1071; draft is in SuperLearnerExtra)
- svmpath - fast SVM
- mlr wrapper, ala caret wrapper (if possible)
- reinforcement learning trees
- MlBayesOpt versions of xgboost, ranger, svm (assuming the package works ok)
- sparsediscrim (https://github.com/ramhiser/sparsediscrim)
- SL.nnet - expose more key arguments for hyperparameter search
- SL.knn - clean up documentation, stop() for gaussian
- SL.rpartPrune - integrate with SL.rpart to reduce code duplication.
- SL.caret
- Incorporate some of David Benkeser's changes from his SL lecture
- Best single variable benchmark as described on Win Vector
Reporting
- SuperLearner: SE of risk estimate for learners in the results table, just like CV.SL.
- See ck37r::sl_stderr
- Weight table for CV.SuperLearner (see ck37r::cvsl_weights)
- Better default output for print.CV.SuperLearner
- Overfitting diagnostic(s)
- Estimated training set error for each learner
- Ratio & absolute difference with test set error?
- Meta-learner support to set overfitting learners to zero weight (e.g. when using TMLE rather than CV-TMLE)?
Visualization
- SuperLearner: plot just like CV.SuperLearner but without SL and Discrete SL.
- Working code in ck37r::plot.SuperLearner
- Could plot empirical risk for SL and Discrete SL, but warn about bias
- Partial dependence plots and/or individual conditional expectation
- CV.SuperLearner plot should insert the # of folds, rather than print "V-Fold".
Methodology
- Option to return training split-specific fits, to support revere SL.
- Option to not re-fit to full data, to support reverse SL.
- SL meta-learner should work even if some learners are NAs.
- Multinomial classification - integrate origami wrappers
- Support for repeated cross-validation
- And ability to add repeats after the initial fit.
- sl_fit_libraries and sl_analyze_libraries for fitting & analyzing sets of libraries.
- Iteratively add learners to the ensemble, esp. for cumulative analysis and iterative hyperparameter optimization
- Or remove learners (e.g. overfitting ones)
- RF-style permutation-based variable importance analysis for SL and CV.SL
- CV.SuperLearner with multiple meta-learners fit for comparison.
- Also for SuperLearner
- create.Learner() - option to pass in custom grid, e.g. when you want to use a restricted set of hyperparameter combinations.
- Random hyperparameter search, possibly via create.Learner()
- More hyperparameter search: rHyperband, bayesian
- AutoML like h2o, TPOT, etc.
- Apply multiple screeners in a sequence.
- Data pre-processing step prior to feature selection.
Architecture
- Unify parallelization under SuperLearner() - deprecate mcSuperLearner and snowSuperLearner
- Support nested (external) cross-validation directly within SuperLearner() - deprecate CV.SuperLearner
- Review function arguments for SuperLearner and CV.SuperLearner
Usability
- Progress indicator! (progress package)
- CVFolds and SuperLearner: return an error if cvControl folds is not defined.
Quality / Organization
- test_wrapper() to run a battery of standard tests on each wrapper.
- 90%+ test coverage
- Metalearner functions in method.R should be in separate files.
- Function to list all methods, like listWrappers() (or add to listWrappers()).
- Metalearner functions need documentation and examples.
- Review ?SuperLearner and ?CV.Superlearner.
- Display CV iteration at the beginning of each fold.
- Report CV time elapsed at the end of each fold.
- Move all wrappers into R/wrappers
- Move all screeners into R/screeners
- Pass linting
- Convert Rd files to Roxygen
- Cheatsheet
- Output fold-specific learner performance and timing during fitting when verbose = T
Screeners
- Check that all screening functions exist before fitting any screeners. Allows for faster debugging.
- Screening by correlation could be improved for binary outcomes (polychoric).
Performance
- Improve parallelization approach: SL currently parallelizes over each fold, rather than each fold * each learner.
- Time each algorithm, report time and percentage of total computation time.
- Fix learners that save model data in their objects unnecessarily, increasing RAM. (via Jeremy)
- Time full function execution automatically.
- Parallel prediction
- Rcpp implementation if it will help with anything
- screen.corP and screen.corRank on large+wide datasets may benefit
- Support OpenML: https://arxiv.org/abs/1701.01293
- Diversity metrics, e.g. Yule's Q-statistic.
- See Meynet, J., & Thiran, J. P. (2010). Information theoretic combination of pattern classifiers. Pattern Recognition, 43(10), 3412-3421.
- Unclear how to implement or if this is actually useful.