Skip to content

Incorporation of Python to build and evaluate several machine learning models to predict credit risk. Use of the Scikit-learn machine learning library to create machine learning models .

Notifications You must be signed in to change notification settings

g626s/Credit_Risk_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Credit_Risk_Analysis

Credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. Therefore, for this project we employed different techniques to train and evaluate models with unbalanced classes.

Resources

Dataset:
  • LoanStats_2019Q1.csv
Software and IDE:
  • Python
  • Jupyter Notebook
  • Libraries:
    • Numpy
    • Scikit-learn
    • Imbalanced-learn
  • Techniques:
    • Ensemble
    • Resampling

Overview of the loan prediction risk analysis

The imbalanced-learn and scikit-learn libraries were implemented to build and evaluate models using resampling. Using the credit card credit dataset from LendingClub, a peer-to-peer lending services company, we oversampled the data using the RandomOverSampler and SMOTE algorithms, and undersampled the data using the ClusterCentroids algorithm. We then used a combinatorial approach of over- and undersampling using the SMOTEENN algorithm. In the next phase of our project, we compared two new machine learning models that reduce bias, BalancedRandomForestClassifier and EasyEnsembleClassifier, to predict credit risk. Lastly, we evaluated the performance of these models and make a written recommendation on whether they should be used to predict credit risk.

Results

Using our knowledge of the imbalanced-learn and scikit-learn libraries, we evaluated three machine learning models by using resampling to determine which is better at predicting credit risk. First, we used the oversampling RandomOverSampler and SMOTE algorithms, and then we used the undersampling ClusterCentroids algorithm. Using these algorithms, we resampled the dataset, view the count of the target classes, train a logistic regression classifier, calculate the balanced accuracy score, generate a confusion matrix, and generate a classification report.

  • Native Random Oversampling: Screen Shot 2022-10-16 at 8 08 24 PM

    • Balance Accuracy Score: 0.60
    • Precision: 0.99
    • Recall: 0.65
    • F1: 0.78
  • SMOTE Oversampling: Screen Shot 2022-10-16 at 8 08 36 PM

    • Balance Accuracy Score: 0.69
    • Precision: 0.99
    • Recall: 0.66
    • F1: 0.79
  • Cluster Centroid Undersampling: Screen Shot 2022-10-16 at 8 09 12 PM

    • Balance Accuracy Score: 0.50
    • Precision: 0.99
    • Recall: 0.48
    • F1: 0.65

The next phase of the project was to implement a combinatorial approach of over- and undersampling with the SMOTEENN algorithm to determine if the results from the combinatorial approach are better at predicting credit risk than the resampling algorithms from the previous step. Using the SMOTEENN algorithm, we resampled the dataset, view the count of the target classes, train a logistic regression classifier, calculate the balanced accuracy score, generate a confusion matrix, and generate a classification report.

  • SMOTEENN Combination Sampling: Screen Shot 2022-10-16 at 8 14 10 PM

    • Balance Accuracy Score: 0.61
    • Precision: 0.99
    • Recall: 0.59
    • F1: 0.73

Lastly, we used the imblearn.ensemble library. We trained and compared two different ensemble classifiers, BalancedRandomForestClassifier and EasyEnsembleClassifier, to predict credit risk and evaluate each model. Using both algorithms, we resampled the dataset, view the count of the target classes, train the ensemble classifier, calculate the balanced accuracy score, generate a confusion matrix, and generate a classification report.

  • Balanced Random Forest Classifier: Screen Shot 2022-10-16 at 8 18 05 PM

    • Balance Accuracy Score: 0.79
    • Precision: 0.99
    • Recall: 0.89
    • F1: 0.94
  • Easy Ensemble AdaBoost Classifier: Screen Shot 2022-10-16 at 8 18 25 PM

    • Balance Accuracy Score: 0.94
    • Precision: 1.00
    • Recall: 0.94
    • F1: 0.97

Summary

In regards to Credit Risk being apart of the financial sector and key player in the financial industry, in this case sensitivity is more valuable than precision for analyzing risk and rates on individuals. Banks historically want and are able to mark and evaluate high-risk individuals as high-risk and vice versa depending on selected factors. All six algorithmns in this project scored very low for high-risk individuals in terms of precision with the highest being the Easy Ensemble AdaBoost Classifier with 8% precision. This translates to that out of all the customers being marked as high-risk, 8% were actually high risk that could be detriment to the firms using these models that lead to inaccuracies and margin of errors for both the firms and customers.

In addition, in this project the aspect of precision is not giving us enough information to compare the six algorithms, so the best possible analysis would be to analyze sensitivity. The model with the highest sensitivity was the Easy Ensemble AdaBoost Classifier. In terms of recall, 94% was for high-risk and 94% was for low-risk individuals. This translates to that 91% of the time all the high-risk individuals are marked as high-risk individuals. Followed by this model, the other two with high recall were the Random Forest Classifier (89%) and SMOTEENN Resample 66%).

Lastly, we analyzed the balanced accuracy score to make the final decision of which machine-learning model to use. The accuracy score stands for how correct was the machine-learning model that translates to out of all the predictions how many of them were true to the classification. As we were able to see, the model with the highest accuracy score was again the Easy Ensemble AdaBoost Classifier. This model is be recommended for its all around high metrics of score.

About

Incorporation of Python to build and evaluate several machine learning models to predict credit risk. Use of the Scikit-learn machine learning library to create machine learning models .

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published