Skip to content

Latest commit

 

History

History
114 lines (91 loc) · 8.85 KB

2023-05.md

File metadata and controls

114 lines (91 loc) · 8.85 KB

May 2023

Feature Selection

Idea: some features may be irrelevant or redundant, and thus be removed with minimal information loss

Techniques

  • Filter methods (model-free selection, deterministic in nature)
    • Low variance: ignore features with same value in all samples
    • Missing values: ignore features with significant portion of values missing
    • Collinear features: remove features that are highly (anti-)correlated with other features; i.e. tries to minimize redundancies
    • Univariate feature selection: remove features that have the lowest predictive value for the target (using statistical tests); considers each feature individually, so redundant features may show similarly high "importance"
  • Model-based methods (stochastic in nature, and depends on training initialization and the kind of model used)
    • Recursive feature elimination
      • Given an estimator that assign weights to features (like coefficients of a linear model), recursively remove the least important feature and repeat until an optimum number of features are obtained
    • L1-based feature selection (more details)
      • A linear model whose coefficients are penalized with L1-norm is fit, which results in sparse solutions (only a few features gets used)
    • Tree-based feature selection (more details):
      • Train a tree-based model like random forest or gradient boosted trees
      • For a given decision tree, it is pretty straightforward to compute an "importance score" for each attribute, based on how well it was able to split the train data for prediction (decrease in impurity/entropy).
      • Limitation: derived from the train set, so if the model overfits, "important" features may not be useful for generalization
    • Permutation importance (more details):
      • Train a model (any model is fine), and compute the performance metric on a dataset (preferrably the test set)
      • Permute the values of a feature in the dataset, and compute the metric again.
      • Repeat the above step several times for each feature.
      • The feature importance is the average reduction in performance metric due to permutation.
      • Limitation: may consider a pair of correlated features to each have low importance even though removing both features may have a huge negative impact.

Related:

Feature Extraction

Idea: transform arbitrary data (tabular, text or images) into a set of numerical features that are (much) easier to use in learning algorithms

Techniques

  • One-hot encoding, to convert a categorical feature into numerical features
  • Principal component analysis (PCA), a linear transformation of vector space where dimensions gets ordered by variance; thus the first few dimensions will capture most of the information
  • t-distributed Stochastic Neighbor Embedding (t-SNE), non-linear projection to a lower dimensional space where pairwise distribution of distances in both vector spaces are made to be similar (using gradient descent)
  • Text feature extraction, to convert text tokens to vectors. Examples: bag of words repr, word2vec, etc.
  • Autoencoder, to convert complex data (such as images or text) into a set of vectors

Related:

Data Visualization

A few broad (and somewhat overlapping) types of visualizations are:

Related:

Confidence Estimation

  • Estimate the probability that the model's prediction is indeed correct.
  • Straightforward in the case of classification problems where each class gets a number associated with it.
    • A softmax on the logits can map the values into a sensible range, but are they representative of the underlying probability of model correctness? Most often not.
  • We may assess the correctness of our confidence estimates using a calibration plot (a.k.a reliability diagram).
  • We may improve the correctness of our confidence estimates using various calibration techniques such as Platt scaling and Isotonic regression.

Further reading:

MLOps Tools

These are tools used for:

  • Model registry: a centralized model store, with versioning capabilities
  • Data/feature registry: a centralized store for training/validation data (and perhaps feature engineering)
  • Experiment tracking and logging: record parameters, metrics, models and results
  • Workflow orchestration: to facilitate continuous integration and deployment

A good set of tools should facilitate:

  • Diagnostics: result analysis, model comparison and debugging during training and validation
  • Versioning and reproducibility: efficiently version and package source code, (frozen) dependencies, hyperparameters and training data associated with each trained model
  • Easy collaboration between research, engineering and data teams
  • Deployment and management of models using various scalable serving infrastructure
  • Logging and monitoring resource usage, various properties of input/output, and errors, to detect bugs, data drift etc.

Example tools: MLFlow, Kubeflow, Apache Airflow (workflow orchestration tool), AWS SageMaker, etc.

Related: