This repository contains the code and components for our Data Science Final Project, which focuses on building a full data pipeline for dataset exploration, preprocessing, model training, and hyperparameter tuning. Our project compares the performance of various machine learning models, including Neural Networks (NN), Support Vector Machines (SVM), Random Forests (RF), and K-Nearest Neighbors (KNN), and also includes an ensemble method for enhanced performance.
To run the full data pipeline:
- Open and run the
main.ipynb
file. - This script will install the necessary dependencies via
pip
and call each subnotebook for seamless execution.
The project is organized into the following components:
main.ipynb
: Orchestrates the end-to-end data pipeline.data_exploration.ipynb
: Generates charts and visualizations to provide insights into the dataset distribution.data_preprocessing.ipynb
: Handles data cleaning, transformation, and preparation for model training.model_training.ipynb
: Defines and trains machine learning models, including Neural Networks, SVM, Random Forest, and KNN.grid_search.ipynb
: Performs hyperparameter tuning to optimize model performance using Grid Search.ensemble_method.ipynb
: Combines multiple models through an ensemble method to improve prediction accuracy.
As a key member of the project team, I contributed significantly in the following areas:
- Model Evaluation: Evaluated and compared the performance of different machine learning models (Neural Networks, SVM, Random Forest, KNN), using metrics such as accuracy, precision, and recall.
- Dataset Selection: Conducted research to select a suitable dataset for the problem at hand, ensuring it met the project's objectives and provided meaningful insights.
- Code Contributions: Drafted and implemented critical sections of the model evaluation and hyperparameter tuning code to improve the performance of our models.
- Languages: Python
- Tools: Jupyter Notebooks, Pandas, Scikit-learn, Matplotlib, Seaborn
- Models: Neural Networks (NN), Support Vector Machines (SVM), Random Forest (RF), K-Nearest Neighbors (KNN)
- Machine Learning Techniques: Data Preprocessing, Model Training, Hyperparameter Tuning, Ensemble Methods
- Clone the repository to your local machine:
git clone https://github.com/your-username/project-repo.git
- Install the dependencies:
pip install -r requirements.txt
- Run the Jupyter notebook: