Credit_Risk_Analysis

Credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. Therefore, for this project we employed different techniques to train and evaluate models with unbalanced classes.

Resources

Dataset:

LoanStats_2019Q1.csv

Software and IDE:

Python
Jupyter Notebook
Libraries:
- Numpy
- Scikit-learn
- Imbalanced-learn
Techniques:
- Ensemble
- Resampling

Overview of the loan prediction risk analysis

The imbalanced-learn and scikit-learn libraries were implemented to build and evaluate models using resampling. Using the credit card credit dataset from LendingClub, a peer-to-peer lending services company, we oversampled the data using the RandomOverSampler and SMOTE algorithms, and undersampled the data using the ClusterCentroids algorithm. We then used a combinatorial approach of over- and undersampling using the SMOTEENN algorithm. In the next phase of our project, we compared two new machine learning models that reduce bias, BalancedRandomForestClassifier and EasyEnsembleClassifier, to predict credit risk. Lastly, we evaluated the performance of these models and make a written recommendation on whether they should be used to predict credit risk.

Results

Using our knowledge of the imbalanced-learn and scikit-learn libraries, we evaluated three machine learning models by using resampling to determine which is better at predicting credit risk. First, we used the oversampling RandomOverSampler and SMOTE algorithms, and then we used the undersampling ClusterCentroids algorithm. Using these algorithms, we resampled the dataset, view the count of the target classes, train a logistic regression classifier, calculate the balanced accuracy score, generate a confusion matrix, and generate a classification report.

Native Random Oversampling:
- Balance Accuracy Score: 0.60
- Precision: 0.99
- Recall: 0.65
- F1: 0.78
SMOTE Oversampling:
- Balance Accuracy Score: 0.69
- Precision: 0.99
- Recall: 0.66
- F1: 0.79
Cluster Centroid Undersampling:
- Balance Accuracy Score: 0.50
- Precision: 0.99
- Recall: 0.48
- F1: 0.65

The next phase of the project was to implement a combinatorial approach of over- and undersampling with the SMOTEENN algorithm to determine if the results from the combinatorial approach are better at predicting credit risk than the resampling algorithms from the previous step. Using the SMOTEENN algorithm, we resampled the dataset, view the count of the target classes, train a logistic regression classifier, calculate the balanced accuracy score, generate a confusion matrix, and generate a classification report.

SMOTEENN Combination Sampling:
- Balance Accuracy Score: 0.61
- Precision: 0.99
- Recall: 0.59
- F1: 0.73

Lastly, we used the imblearn.ensemble library. We trained and compared two different ensemble classifiers, BalancedRandomForestClassifier and EasyEnsembleClassifier, to predict credit risk and evaluate each model. Using both algorithms, we resampled the dataset, view the count of the target classes, train the ensemble classifier, calculate the balanced accuracy score, generate a confusion matrix, and generate a classification report.

Balanced Random Forest Classifier:
- Balance Accuracy Score: 0.79
- Precision: 0.99
- Recall: 0.89
- F1: 0.94
Easy Ensemble AdaBoost Classifier:
- Balance Accuracy Score: 0.94
- Precision: 1.00
- Recall: 0.94
- F1: 0.97

Summary

In regards to Credit Risk being apart of the financial sector and key player in the financial industry, in this case sensitivity is more valuable than precision for analyzing risk and rates on individuals. Banks historically want and are able to mark and evaluate high-risk individuals as high-risk and vice versa depending on selected factors. All six algorithmns in this project scored very low for high-risk individuals in terms of precision with the highest being the Easy Ensemble AdaBoost Classifier with 8% precision. This translates to that out of all the customers being marked as high-risk, 8% were actually high risk that could be detriment to the firms using these models that lead to inaccuracies and margin of errors for both the firms and customers.

In addition, in this project the aspect of precision is not giving us enough information to compare the six algorithms, so the best possible analysis would be to analyze sensitivity. The model with the highest sensitivity was the Easy Ensemble AdaBoost Classifier. In terms of recall, 94% was for high-risk and 94% was for low-risk individuals. This translates to that 91% of the time all the high-risk individuals are marked as high-risk individuals. Followed by this model, the other two with high recall were the Random Forest Classifier (89%) and SMOTEENN Resample 66%).

Lastly, we analyzed the balanced accuracy score to make the final decision of which machine-learning model to use. The accuracy score stands for how correct was the machine-learning model that translates to out of all the predictions how many of them were true to the classification. As we were able to see, the model with the highest accuracy score was again the Easy Ensemble AdaBoost Classifier. This model is be recommended for its all around high metrics of score.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
credit_risk_ensemble.ipynb		credit_risk_ensemble.ipynb
credit_risk_resampling.ipynb		credit_risk_resampling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit_Risk_Analysis

Resources

Overview of the loan prediction risk analysis

Results

Summary

About

Releases

Packages

Languages

g626s/Credit_Risk_Analysis

Folders and files

Latest commit

History

Repository files navigation

Credit_Risk_Analysis

Resources

Overview of the loan prediction risk analysis

Results

Summary

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages