This project aims to build a spam detection system using machine learning. It classifies messages into two categories: "Spam" or "Ham" (non-spam). The project uses a dataset of labeled messages, preprocesses the text data, trains a logistic regression model, and evaluates its performance.
The project requires the following Python packages:
pandas
for data manipulationmatplotlib
for plottingseaborn
for visualizationscikit-learn
for machine learning
You can install the required packages using pip:
pip install pandas matplotlib seaborn scikit-learn
The dataset used in this project is spam.csv
, which contains labeled messages. The dataset should be placed in the same directory as the script. The dataset should have the following columns:
label
: Indicates whether the message is spam (spam
) or not (ham
).message
: The text of the message.
-
Data Loading and Preprocessing
- Loads the dataset from
spam.csv
. - Drops unnecessary columns and renames remaining columns.
- Maps labels to binary values (0 for 'ham', 1 for 'spam').
- Loads the dataset from
-
Visualization
- Displays a pie chart showing the distribution of spam and ham messages.
-
Data Splitting
- Splits the dataset into training and testing sets.
-
Feature Extraction
- Initializes and applies TF-IDF Vectorizer to convert text data into numerical format.
-
Model Training
- Trains a Logistic Regression model using the training data.
-
Evaluation
- Evaluates the model's performance using accuracy, classification report, confusion matrix, and ROC curve.
-
Testing
- Tests the spam detection function with a few sample messages.
To run the script, execute the following command in your terminal:
python spam_detection.py
The script will print the accuracy of the model and a classification report. It will also display plots for the confusion matrix and ROC curve. Additionally, it will print predictions for a set of test messages.
The following messages are used for testing the spam detection function:
- "Congratulations! You've won a free ticket to the Bahamas. Call now!"
- "Reminder: Your appointment is scheduled for tomorrow at 10 AM."
- "URGENT! Your account has been compromised. Please contact support immediately."
- "Hey, are we still meeting for lunch tomorrow?"
- "You have received a bonus of $500. Click here to claim your prize."
Accuracy: 0.98
precision recall f1-score support
ham 0.98 0.98 0.98 145
spam 0.98 0.98 0.98 55
accuracy 0.98 200
macro avg 0.98 0.98 0.98 200
weighted avg 0.98 0.98 0.98 200
Confusion Matrix:
[[143 2]
[ 2 53]]
ROC Curve:
AUC: 0.99
Message 1: Congratulations! You've won a free ticket to the Bahamas. Call now!
Prediction: Spam
Message 2: Reminder: Your appointment is scheduled for tomorrow at 10 AM.
Prediction: Ham
Message 3: URGENT! Your account has been compromised. Please contact support immediately.
Prediction: Spam
Message 4: Hey, are we still meeting for lunch tomorrow?
Prediction: Ham
Message 5: You have received a bonus of $500. Click here to claim your prize.
Prediction: Spam
- Ensure that the
spam.csv
dataset file is present in the same directory as the script. - The accuracy and performance metrics may vary depending on the dataset and model configuration.