Idea: some features may be irrelevant or redundant, and thus be removed with minimal information loss
Techniques
- Filter methods (model-free selection, deterministic in nature)
- Low variance: ignore features with same value in all samples
- Missing values: ignore features with significant portion of values missing
- Collinear features: remove features that are highly (anti-)correlated with other features; i.e. tries to minimize redundancies
- Fast algorithm (less than quad time complexity): Fast Correlation-Based Filter (FCBF)
- Univariate feature selection: remove features that have the lowest predictive value for the target (using statistical tests); considers each feature individually, so redundant features may show similarly high "importance"
- Model-based methods (stochastic in nature, and depends on training initialization and the kind of model used)
- Recursive feature elimination
- Given an estimator that assign weights to features (like coefficients of a linear model), recursively remove the least important feature and repeat until an optimum number of features are obtained
- L1-based feature selection (more details)
- A linear model whose coefficients are penalized with L1-norm is fit, which results in sparse solutions (only a few features gets used)
- Tree-based feature selection (more details):
- Train a tree-based model like random forest or gradient boosted trees
- For a given decision tree, it is pretty straightforward to compute an "importance score" for each attribute, based on how well it was able to split the train data for prediction (decrease in impurity/entropy).
- Limitation: derived from the train set, so if the model overfits, "important" features may not be useful for generalization
- Permutation importance (more details):
- Train a model (any model is fine), and compute the performance metric on a dataset (preferrably the test set)
- Permute the values of a feature in the dataset, and compute the metric again.
- Repeat the above step several times for each feature.
- The feature importance is the average reduction in performance metric due to permutation.
- Limitation: may consider a pair of correlated features to each have low importance even though removing both features may have a huge negative impact.
- Recursive feature elimination
Related:
Idea: transform arbitrary data (tabular, text or images) into a set of numerical features that are (much) easier to use in learning algorithms
Techniques
- One-hot encoding, to convert a categorical feature into numerical features
- Principal component analysis (PCA), a linear transformation of vector space where dimensions gets ordered by variance; thus the first few dimensions will capture most of the information
- t-distributed Stochastic Neighbor Embedding (t-SNE), non-linear projection to a lower dimensional space where pairwise distribution of distances in both vector spaces are made to be similar (using gradient descent)
- Text feature extraction, to convert text tokens to vectors. Examples: bag of words repr, word2vec, etc.
- Autoencoder, to convert complex data (such as images or text) into a set of vectors
Related:
A few broad (and somewhat overlapping) types of visualizations are:
- Categorical distribution visualization
- Compare distributions of a continuous feature (Y-axis) against values of a categorical feature (X-axis)
- Does mean/median/variance/etc. differ between subgroups?
- Use categorical plots like violin-plot or box-plot
- Categorical composition visualization
- Show the categorical makeup of data against a numerical or another categorical feature.
- How much does each subgroup contribute to the total? How does it change over time?
- Use pie chart, stacked bar chart, stacked area plot etc.
- Univariate distribution visualization
- Correlation visualization
- Illustrate the correlation between features pair-wise
- Use bivariate plots, pair-wise correlation heatmap, pair-wise scatter plots or scatter matrix
- High-dimensional structure visualization
- Are there easily derivable structure in the dataset
- Use 2D or 3D scatter plot of PCA, Andrews Curves, parallel coordinates, RadViz
Related:
- Estimate the probability that the model's prediction is indeed correct.
- Straightforward in the case of classification problems where each class gets a number associated with it.
- A softmax on the logits can map the values into a sensible range, but are they representative of the underlying probability of model correctness? Most often not.
- We may assess the correctness of our confidence estimates using a calibration plot (a.k.a reliability diagram).
- We may improve the correctness of our confidence estimates using various calibration techniques such as Platt scaling and Isotonic regression.
Further reading:
- C. Guo et. al. - On Calibration of Modern Neural Networks
- F. Küppers et. al. - Confidence Calibration for Object Detection and Segmentation
- G. Pereyra et. al. - Regularizing Neural Networks by Penalizing [High Confidence!]
- An overview article on confidence calibration
These are tools used for:
- Model registry: a centralized model store, with versioning capabilities
- Data/feature registry: a centralized store for training/validation data (and perhaps feature engineering)
- Experiment tracking and logging: record parameters, metrics, models and results
- Workflow orchestration: to facilitate continuous integration and deployment
A good set of tools should facilitate:
- Diagnostics: result analysis, model comparison and debugging during training and validation
- Versioning and reproducibility: efficiently version and package source code, (frozen) dependencies, hyperparameters and training data associated with each trained model
- Easy collaboration between research, engineering and data teams
- Deployment and management of models using various scalable serving infrastructure
- Logging and monitoring resource usage, various properties of input/output, and errors, to detect bugs, data drift etc.
Example tools: MLFlow, Kubeflow, Apache Airflow (workflow orchestration tool), AWS SageMaker, etc.
Related: