Skip to content

This GitHub project focuses on sentiment analysis for Twitter data using RoBERTa and traditional algorithms. We preprocess the datasets with text cleaning and tokenization. RoBERTa's contextualized word embeddings and fine-tuned architecture excel in sentiment classification, outperforming traditional algorithms, especially with larger datasets.

Notifications You must be signed in to change notification settings

somerandomEthan/DLNLP_assignment_23

Repository files navigation

README

DLNLP_23_SN22081179

This is ELEC 0141 Deep Learning for Natural Language Processing 22/23 project which aimed at doing the sentiment analysis on tweets. The datasets are the datasets from Kaggle and Sentiment140.

1. Prerequisites

To Begin with, it is required to clone this project or download to your computer or server. The structure of the folders are as follows:

DLNLP_assignment_23

The folder Datasets is not included, you can either download the required Datesets according to the file tree or you can download the folder here. The checkpoints are also not included and can also be downloaded here

The environment

My advice is to create a new conda environment from the environment.yml file in this repo environment.yml You can simply do it by:

conda env create -f environment.yml

2. How to check the result of this project

If your server or computer is GPU ready, then you can proceed to check the RoBERTa Results. Or you can also refer to the nohup.out which is the output from the server

(For the project I used turin.ee.ucl.ac.uk, I checked whether it works on this server, I strongly suggest that you test the project on the server as RoBERTa is a huge model) You can simply check by input:

torch.cuda.is_available()

If the output is True, then congratulation that you can start from the training by exectuting the corresponding file. You can simply download the checkpoints and the run the eval_roberta_S.py to test the results on Kaggle dataset or run the eval_roberta.py to check the result on Sentiment140. Or you can simply fine tuning the model form scratch and genrate your own checkpoints via runing train_roberta.py The RoBERTa code is written seperately so you can train and test it on GPU server

To check the result of the Logistic Regression and Random Forest:

you can simply run main.py

Testing the Kaggle dataset:

You need to comment out the function which obtains the Sentiment140 dataset. The code should be like:

def main():
    """
    Main function
    """
    train_df, test_df = kaggle_data()
    # train_df, test_df = sentiment140_data()
    logistic_regression(train_df, test_df)
    random_forest(train_df, test_df)

Testing the Sentiment140 dataset:

You need to comment out the function which obtains the Kaggle dataset. The code should be like:

def main():
    """
    Main function
    """
    # train_df, test_df = kaggle_data()
    train_df, test_df = sentiment140_data()
    logistic_regression(train_df, test_df)
    random_forest(train_df, test_df)

About

This GitHub project focuses on sentiment analysis for Twitter data using RoBERTa and traditional algorithms. We preprocess the datasets with text cleaning and tokenization. RoBERTa's contextualized word embeddings and fine-tuned architecture excel in sentiment classification, outperforming traditional algorithms, especially with larger datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages