Credit Card fraud detection based on Kaggle dataset. Applied and tested with Clustering, Logistic Regression, Random Forest, and XG BOOST, along with some sampling techniques for balancing the data.
- Features V1 to V28 are the principal components obtained with PCA, so they are scaled. Only time and amount need to be scaled.
- The F1-score is a great scoring metric for imbalanced data when more attention is needed on the positives, making it suitable for measuring model performance.
- The dataset is highly imbalanced, and it is important to take care of overfitting on the Non-Fraud class. The main techniques used were Random Under-sampling and SMOTE for oversampling the minority class.
- Secondly, be aware that Fraud transactions can be natural outliers compared to Non-Fraud transactions. Be careful about Anomaly detection, especially outlier removal.
- Be careful about splitting test and train data before applying any sampling techniques. Only apply sampling techniques to the train data.
- At the end, be cautious about sampling and cross-validation; if not applied correctly, it can cause data leakage.