A Reddit Flair Detector web application to detect flairs of India subreddit posts using Machine Learning algorithms. The application can be found live at Reddit Flair Detector.
The directory largely contains 2 ipython notebooks and 1 web directory:
-
Data-Collection Notebook contains all the code that was used to fetch data using PRAW API from reddit and adding it to the mongodb database using pymongo. It was the fetched back from the data base and pre-processed to add it into a CSV file to do the data analysis and build the machine learning model.
-
Exploratory Data Analysis Notebook contains all the code that was used to analyse and visualize the data.
-
Flair-Detector Notebook contains the code used to train various machine learning models and check the accuracy on different features.
-
Website Directory the directory contains the flask implementation of the app,the requirements and procfile for heroku deployment. The detail of each file in the directory can be found in the readme.md file of the directory.
The entire code has been developed using Python programming language, utilizing it's powerful text processing and machine learning modules. The application has been developed using Flask web framework and hosted on Heroku web server.
- Open the
Terminal
. - Clone the repository by entering
git clone https://github.com/divyanshuaggarwal/Reddit-Flair-Detector.git
and navigate intowebsite
directory by enteringcd website
in the terminal. - Ensure that
Python3
andpip
are installed on the system. - Create a
virtualenv
by executing the following command:virtualenv venv
. - Activate the
venv
virtual environment by executing the follwing command:source venv/bin/activate
. - Enter the cloned repository directory and execute
pip install -r requirements.txt
. - Now, execute the following command:
flask run
and it will point to thelocalhost
server with the port5000
. - Enter the
IP Address: http://localhost:5000
on a web browser and use the application.
The following dependencies can be found in requirements.txt:
Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using [2] which described various machine learning models like Naive-Bayes, Linear SVM and Logistic Regression for text classification with code snippets. Along with this, I tried other models like Random Forest and Multi-Layer Perceptron for the task. I have obtained test accuracies on various scenarios which can be found in the next section.
The approach taken for the task is as follows:
- Collect 100 India subreddit data for each of the 12 flairs using
praw
module [1] and 2. - The data includes title, comments, body, url, author, score, id, time-created and number of comments.
- For comments, only top level comments (top 10) are considered in dataset and no sub-comments are present.
- The title, comments and body are cleaned by removing bad symbols and stopwords using
nltk
. - Five types of features are considered for the the given task: a) Title b) Comments c) Urls d) Body e) Combining Title, Comments and Urls as one feature.
- The dataset is split into 80% train and 20% test data using
train-test-split
ofscikit-learn
. - The dataset is then converted into a
Vector
andTF-IDF
form. - Then, the following ML algorithms (using
scikit-learn
libraries) are applied on the dataset:
a) Naive-Bayes
b) Linear Support Vector Machine
c) Logistic Regression
d) Random Forest
e) MLP
f) XGBoost
- Training and Testing on the dataset showed the XGBoost showed the best testing accuracy of 94.4% when trained on the combination of Title + Comments + URL feature.
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.8571428571 |
Linear SVM | 0.8995535714 |
Logistic Regression | 0.8973214285 |
Random Forest | 0.904017 |
MLP | 0.8772321428 |
XGBoost | 0.8080357142 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.2678571428 |
Linear SVM | 0.3906250000 |
Logistic Regression | 0.415178 |
Random Forest | 0.4084821428 |
MLP | 0.4107142857 |
XGBoost | 0.4218750000 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.6919642857 |
Linear SVM | 0.810267857100 |
Logistic Regression | 0.82142857142 |
Random Forest | 0.8191964285 |
MLP | 0.8281250000 |
XGBoost | 0.4799107142 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.7544642857 |
Linear SVM | 0.8660714285 |
Logistic Regression | 0.8683035714 |
Random Forest | 0.8928571428 |
MLP | 0.8459821428 |
XGBoost | 0.8147321428 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.8125000000 |
Linear SVM | 0.9285714285 |
Logistic Regression | 0.933035714285714 |
Random Forest | 0.94196428571 |
MLP | 0.8660714285 |
XGBoost | 0.944196 |
the tests shows that combined features i.e. Title + comments + URL shows the best accuracy while body shows the worst accuracy. Title as feature and comments as features are close runner ups, followed by URL. As machine learning models tries to detect specific words to identify the sentiment it makes sense because more the content means more information. Title as feature performing so well can be due to the fact the title consists of all the keywords to expect in the body, and comments can show a pattern on what topic the discussion is going on.
- http://machineloveus.com/mining-reddit-data-or-links-to-33-python-cheat-sheets/
- http://www.storybench.org/how-to-scrape-reddit-with-python/
- https://towardsdatascience.com/scraping-reddit-data-1c0af3040768
- https://api.mongodb.com/python/current/tutorial.html
- https://medium.com/themlblog/splitting-csv-into-train-and-test-data-1407a063dd74
- https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568
- https://medium.com/@robert.salgado/multiclass-text-classification-from-start-to-finish-f616a8642538
- https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/
- https://medium.com/techkylabs/getting-started-with-python-flask-framework-part-1-a4931ce0ea13 (entire series)
- https://towardsdatascience.com/designing-a-machine-learning-model-and-deploying-it-using-flask-on-heroku-9558ce6bde7b
- https://www.freecodecamp.org/news/how-to-build-a-web-application-using-flask-and-deploy-it-to-the-cloud-3551c985e492/
- https://hackernoon.com/deploy-a-machine-learning-model-using-flask-da580f84e60c
- https://blog.cambridgespark.com/deploying-a-machine-learning-model-to-the-web-725688b851c7