GitHub

Probability calibration for unbalanced data classification

Dataset: Link

Two probability calibration methods of Platt and Beta are used to calibrate the scores from three models. The three models tried for the binary classification in this unbalanced dataset are Random Forrest, a NN with three dense layers, and a Logistic Regression.

Bayesian Neural Network

A simple BNN for multiclass classification is demonstrated. Dataset: Data

Some of the theory behind BNN at high level is shown in the notebook. Reference given to Link. BNN is a powerful tool to use when the amount of available training data is small and we want to avoid overfitting. Also the uncertainities in the predictions can be estimated which make them popular for predictions in critical applications.

Feature Selection: manually or using trees

A simple example on manual feature selection by doing data exploration of numerical and categorical variables on a small dataset. If the dataset or number of features are too big the manual approach can be difficult. Tree based methods such as random forrest can be used to score the features based on their important. The notebook shows a simple example on this small dataset.

XGBoost

A simple example of using XGboost on two standard datasets from sklearn. XGBoost is a more recent advanced tree-based algorithm that can be used for regression, classification, and feature importance ranking/selection. Generally, it can perform better than random forrest.

Two issues with XGBoost:

unlike random forrest, it may not be easy to do parallel processing in case of very large distributed data.
many hyper-parameters.

AutoGluon

AutoGluon is an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models on an unprocessed tabular dataset such as a CSV file. Paper AWS blog It has the capability of using many different models and automatically stack and create and ensemble model from them. It could be a solution to many problems or at least the first step if the models make sense for the data for the problem. Good feature engineering before using the model could be critical.

Dataset: Bike Sharing Demand

This is a quick test of AutoGluon on Sagemake and did not spent too much time on feature selection and cleaning.

LSTM Model for Non-Linear Regression

A simple LSTM based model for non-linear regression is test on two datasets. The model is implemented with pytorch.

First data is AirPassenger dataset
Second the model is test for prediction of non-linear motion in presense of noise.

It seems this type of models could be a good alternative for traditional estimators such as Kalman Filters.

I found this illustration for LSTM very cool. The explanation should not be confused when using torch for implementing LSTM as the term are different than Keras.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Bayesian_Neural_Network.ipynb		Bayesian_Neural_Network.ipynb
BikeShare_AWS_SageMaker.ipynb		BikeShare_AWS_SageMaker.ipynb
Feature_selection.ipynb		Feature_selection.ipynb
LSTM_for_non_linear_regression_prediction.ipynb		LSTM_for_non_linear_regression_prediction.ipynb
Probability_Calibration_for_Unbalanced_Dataset.ipynb		Probability_Calibration_for_Unbalanced_Dataset.ipynb
README.md		README.md
XGBoost.ipynb		XGBoost.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Probability calibration for unbalanced data classification

Bayesian Neural Network

Feature Selection: manually or using trees

XGBoost

AutoGluon

LSTM Model for Non-Linear Regression

About

Releases

Packages

Languages

arvinemadi/Supervised_Learning

Folders and files

Latest commit

History

Repository files navigation

Probability calibration for unbalanced data classification

Bayesian Neural Network

Feature Selection: manually or using trees

XGBoost

AutoGluon

LSTM Model for Non-Linear Regression

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages