Credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. Therefore, for this project we employed different techniques to train and evaluate models with unbalanced classes.
Dataset:
- LoanStats_2019Q1.csv
Software and IDE:
- Python
- Jupyter Notebook
- Libraries:
- Numpy
- Scikit-learn
- Imbalanced-learn
- Techniques:
- Ensemble
- Resampling
The imbalanced-learn and scikit-learn libraries were implemented to build and evaluate models using resampling. Using the credit card credit dataset from LendingClub, a peer-to-peer lending services company, we oversampled the data using the RandomOverSampler and SMOTE algorithms, and undersampled the data using the ClusterCentroids algorithm. We then used a combinatorial approach of over- and undersampling using the SMOTEENN algorithm. In the next phase of our project, we compared two new machine learning models that reduce bias, BalancedRandomForestClassifier and EasyEnsembleClassifier, to predict credit risk. Lastly, we evaluated the performance of these models and make a written recommendation on whether they should be used to predict credit risk.
Using our knowledge of the imbalanced-learn and scikit-learn libraries, we evaluated three machine learning models by using resampling to determine which is better at predicting credit risk. First, we used the oversampling RandomOverSampler
and SMOTE algorithms
, and then we used the undersampling ClusterCentroids algorithm
. Using these algorithms, we resampled the dataset, view the count of the target classes, train a logistic regression classifier, calculate the balanced accuracy score, generate a confusion matrix, and generate a classification report.
-
- Balance Accuracy Score: 0.60
- Precision: 0.99
- Recall: 0.65
- F1: 0.78
-
- Balance Accuracy Score: 0.69
- Precision: 0.99
- Recall: 0.66
- F1: 0.79
-
Cluster Centroid Undersampling:
- Balance Accuracy Score: 0.50
- Precision: 0.99
- Recall: 0.48
- F1: 0.65
The next phase of the project was to implement a combinatorial approach of over- and undersampling with the SMOTEENN algorithm
to determine if the results from the combinatorial approach are better at predicting credit risk than the resampling algorithms from the previous step. Using the SMOTEENN algorithm
, we resampled the dataset, view the count of the target classes, train a logistic regression classifier, calculate the balanced accuracy score, generate a confusion matrix, and generate a classification report.
Lastly, we used the imblearn.ensemble library. We trained and compared two different ensemble classifiers, BalancedRandomForestClassifier
and EasyEnsembleClassifier
, to predict credit risk and evaluate each model. Using both algorithms, we resampled the dataset, view the count of the target classes, train the ensemble classifier, calculate the balanced accuracy score, generate a confusion matrix, and generate a classification report.
-
Balanced Random Forest Classifier:
- Balance Accuracy Score: 0.79
- Precision: 0.99
- Recall: 0.89
- F1: 0.94
-
Easy Ensemble AdaBoost Classifier:
- Balance Accuracy Score: 0.94
- Precision: 1.00
- Recall: 0.94
- F1: 0.97
In regards to Credit Risk being apart of the financial sector and key player in the financial industry, in this case sensitivity is more valuable than precision for analyzing risk and rates on individuals. Banks historically want and are able to mark and evaluate high-risk individuals as high-risk and vice versa depending on selected factors. All six algorithmns in this project scored very low for high-risk individuals in terms of precision with the highest being the Easy Ensemble
AdaBoost Classifier with 8% precision. This translates to that out of all the customers being marked as high-risk, 8% were actually high risk that could be detriment to the firms using these models that lead to inaccuracies and margin of errors for both the firms and customers.
In addition, in this project the aspect of precision is not giving us enough information to compare the six algorithms, so the best possible analysis would be to analyze sensitivity. The model with the highest sensitivity was the Easy Ensemble
AdaBoost Classifier. In terms of recall, 94% was for high-risk and 94% was for low-risk individuals. This translates to that 91% of the time all the high-risk individuals are marked as high-risk individuals. Followed by this model, the other two with high recall were the Random Forest Classifier (89%) and SMOTEENN Resample 66%).
Lastly, we analyzed the balanced accuracy score to make the final decision of which machine-learning model to use. The accuracy score stands for how correct was the machine-learning model that translates to out of all the predictions how many of them were true to the classification. As we were able to see, the model with the highest accuracy score was again the Easy Ensemble
AdaBoost Classifier. This model is be recommended for its all around high metrics of score.