A machine learning project predicting customer purchase behavior using multiple classification algorithms.
- Total Rows: 10,000
- Total Columns: 9
- customer_id: Unique customer identifier
- age: Customer's age
- gender: Customer's gender
- annual_income: Customer's annual income
- last_visited_days_ago: Days since last website visit
- session_duration: Time spent on website
- pages_visited: Number of pages browsed
- device: Device used for browsing
- purchase: Target variable (Purchase/No Purchase)
Below is a sample dataset that provides customer details, session activity, and purchase behavior:
customer_id | age | gender | annual_income | last_visited_days_ago | session_duration | pages_visited | device | purchase |
---|---|---|---|---|---|---|---|---|
1 | 56 | male | 7 | 17 | 15 | desktop | 0 | |
2 | 69 | female | 47617 | 4 | 35 | 19 | mobile | 0 |
3 | 46 | male | 94258 | 30 | 15 | mobile | 0 | |
4 | 32 | female | 70075 | 19 | 4 | 12 | mobile | 0 |
5 | 60 | male | 146998 | 16 | 51 | mobile | 0 | |
6 | 25 | male | 42631 | 8 | 31 | 16 | desktop | 1 |
7 | 38 | female | 143120 | 31 | 6 | desktop | 1 | |
8 | 56 | male | 117158 | 24 | 9 | 20 | tablet | 0 |
9 | 36 | female | 158955 | 12 | 31 | 15 | desktop | 0 |
- Logistic Regression: 48% accuracy
- Random Forest: 70% accuracy
- Gradient Boosting: 70% accuracy
- Selected Model: Random Forest
- Best Accuracy: 70%
- Python 3.8+
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- Clone the repository
git clone https://github.com/PRANAYBHUMAGOUNI/ML-Based-Customer-Purchase-Prediction-Model/
- Create virtual environment
python -m venv venv
venv\Scripts\activate
- Install dependencies
pip install -r requirements.txt
new_customer = pd.DataFrame({
'age': [69],
'gender': ['female'],
'annual_income': [47617],
'last_visited_days_ago': [4],
'session_duration': [35],
'pages_visited': [19],
'device': ['mobile']
})
# Prediction
prediction = model.predict(new_customer)
probability = model.predict_proba(new_customer)
- Prediction: Will not purchase
- Purchase Probability: 21.48%
- days_since_visit_ratio: Ratio of days since last visit to session duration
- pages_per_minute: Pages visited per minute
- income_age_ratio: Annual income divided by age
- engagement_score: Combination of pages visited and session duration
- Handled missing values
- Encoded categorical variables
- Created engineered features
- Scaled numerical features
- Used GridSearchCV for Random Forest and Gradient Boosting
- Explored parameters:
- Number of estimators
- Max depth
- Min samples split
- Learning rate
- Collect more diverse data
- Experiment with advanced ensemble methods
- Incorporate more complex feature engineering
- Try deep learning approaches