Repository with the source code used on our final project of the Machine Learning Applications course where we implemented an extensive exploratory, semantic, and predictive analysis with Python of a textual dataset compiled of several thousand reddit posts from a ser of multiple selected subreddits.
Universidad Carlos III de Madrid. May, 2021
- Simón E. Sánchez Viloria
- Enrique Botía Barbera
- Andres Ruíz Calvo
- David Méndez Encinas
This project contains the following notebooks that explain the exploratory and scientific analysis made on this project:
-
reddit_data_acquisition.ipynb
: Notebook used to scrap and export several thousand posts from Reddit in order to build the dataset used in this project. We relied on Python's Reddit API WrapperPRAW
to build the dataset. -
task_1_reddit_topic_modeling_and_graph.ipynb
: Where we used exploratory and NLP techniques to to process the textual information obtained from the dataset acquired, vectorize it and do topic modeling with it. -
task_2_subreddit_classification.ipynb
: Notebook used to train, evaluate, and compare the models used to predict the subreddit where a post comes from given only its semantic representation. Performance with both LDA and TFIDF representations are compared among each other.
├── LICENSE
│
├── README.md <- The top-level README for developers using this project.
│
├── notebooks <- Jupyter notebooks that can be used to reproduce the analysis made in this project.
│
├── references <- Project statement and explanatory materials.
│
├── reports <- Contains a report summarizing and documenting our analysis and conclusions.
│ └── figures <- Generated graphics and figures used in the analysis.
│
├── requirements.txt <- The requirements file for reproducing the analysis environment.
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
│
└── src <- Source code for use in this project.
└── __init__.py <- Makes src a Python module
Project based on the cookiecutter data science project template. #cookiecutterdatascience