May 2023

Feature Selection

Idea: some features may be irrelevant or redundant, and thus be removed with minimal information loss

Techniques

Filter methods (model-free selection, deterministic in nature)
- Low variance: ignore features with same value in all samples
- Missing values: ignore features with significant portion of values missing
- Collinear features: remove features that are highly (anti-)correlated with other features; i.e. tries to minimize redundancies
  - Fast algorithm (less than quad time complexity): Fast Correlation-Based Filter (FCBF)
- Univariate feature selection: remove features that have the lowest predictive value for the target (using statistical tests); considers each feature individually, so redundant features may show similarly high "importance"
Model-based methods (stochastic in nature, and depends on training initialization and the kind of model used)
- Recursive feature elimination
  - Given an estimator that assign weights to features (like coefficients of a linear model), recursively remove the least important feature and repeat until an optimum number of features are obtained
- L1-based feature selection (more details)
  - A linear model whose coefficients are penalized with L1-norm is fit, which results in sparse solutions (only a few features gets used)
- Tree-based feature selection (more details):
  - Train a tree-based model like random forest or gradient boosted trees
  - For a given decision tree, it is pretty straightforward to compute an "importance score" for each attribute, based on how well it was able to split the train data for prediction (decrease in impurity/entropy).
  - Limitation: derived from the train set, so if the model overfits, "important" features may not be useful for generalization
- Permutation importance (more details):
  - Train a model (any model is fine), and compute the performance metric on a dataset (preferrably the test set)
  - Permute the values of a feature in the dataset, and compute the metric again.
  - Repeat the above step several times for each feature.
  - The feature importance is the average reduction in performance metric due to permutation.
  - Limitation: may consider a pair of correlated features to each have low importance even though removing both features may have a huge negative impact.

scikit-learn docs: Feature Selection
Wikipedia: Feature Selection
An article in the wild

Feature Extraction

Idea: transform arbitrary data (tabular, text or images) into a set of numerical features that are (much) easier to use in learning algorithms

Techniques

One-hot encoding, to convert a categorical feature into numerical features
Principal component analysis (PCA), a linear transformation of vector space where dimensions gets ordered by variance; thus the first few dimensions will capture most of the information
t-distributed Stochastic Neighbor Embedding (t-SNE), non-linear projection to a lower dimensional space where pairwise distribution of distances in both vector spaces are made to be similar (using gradient descent)
Text feature extraction, to convert text tokens to vectors. Examples: bag of words repr, word2vec, etc.
Autoencoder, to convert complex data (such as images or text) into a set of vectors

scikit-learn docs: Feature Extraction
Wikipedia: Feature Extraction
An article in the wild

Data Visualization

A few broad (and somewhat overlapping) types of visualizations are:

Categorical distribution visualization
- Compare distributions of a continuous feature (Y-axis) against values of a categorical feature (X-axis)
- Does mean/median/variance/etc. differ between subgroups?
- Use categorical plots like violin-plot or box-plot
Categorical composition visualization
- Show the categorical makeup of data against a numerical or another categorical feature.
- How much does each subgroup contribute to the total? How does it change over time?
- Use pie chart, stacked bar chart, stacked area plot etc.
Univariate distribution visualization
- How spread and/or skewed are the values of a feature?
- Use histogram or KDE plot
Correlation visualization
- Illustrate the correlation between features pair-wise
- Use bivariate plots, pair-wise correlation heatmap, pair-wise scatter plots or scatter matrix
High-dimensional structure visualization
- Are there easily derivable structure in the dataset
- Use 2D or 3D scatter plot of PCA, Andrews Curves, parallel coordinates, RadViz

An article in the wild
Visualizing with polar coordinates

Confidence Estimation

Estimate the probability that the model's prediction is indeed correct.
Straightforward in the case of classification problems where each class gets a number associated with it.
- A softmax on the logits can map the values into a sensible range, but are they representative of the underlying probability of model correctness? Most often not.
We may assess the correctness of our confidence estimates using a calibration plot (a.k.a reliability diagram).
We may improve the correctness of our confidence estimates using various calibration techniques such as Platt scaling and Isotonic regression.

MLOps Tools

These are tools used for:

Model registry: a centralized model store, with versioning capabilities
Data/feature registry: a centralized store for training/validation data (and perhaps feature engineering)
Experiment tracking and logging: record parameters, metrics, models and results
Workflow orchestration: to facilitate continuous integration and deployment

A good set of tools should facilitate:

Diagnostics: result analysis, model comparison and debugging during training and validation
Versioning and reproducibility: efficiently version and package source code, (frozen) dependencies, hyperparameters and training data associated with each trained model
Easy collaboration between research, engineering and data teams
Deployment and management of models using various scalable serving infrastructure
Logging and monitoring resource usage, various properties of input/output, and errors, to detect bugs, data drift etc.

Example tools: MLFlow, Kubeflow, Apache Airflow (workflow orchestration tool), AWS SageMaker, etc.

A 2018 article about difficulties of reproducibility
An article comparing MLFlow+Airflow with Kubeflow
Data versioning using DVC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2023-05.md

2023-05.md

May 2023

Feature Selection

Feature Extraction

Data Visualization

Confidence Estimation

MLOps Tools

Files

2023-05.md

Latest commit

History

2023-05.md

File metadata and controls

May 2023

Feature Selection

Feature Extraction

Data Visualization

Confidence Estimation

MLOps Tools