Skip to content

laiqujan/sgd-based-issue-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Stochastic Gradient Descent (SGD)-based Issue Report Classifier

This repository contains our code and documentation for participation in The NLBSE'23 Tool Competition.

Data Set

We used issue reports data from real open source projects made available by (Kallis et al., [2023] (https://doi.org/10.1007/978-3-031-21388-5_34) for The NLBSE'23 Tool Competition.

Training Data: 1275881

Testing Data: 142320

Steps to run the code

Step-1: Get data

Training Data: https://tickettagger.blob.core.windows.net/datasets/nlbse23-issue-classification-train.csv.tar.gz

Testing Data: https://tickettagger.blob.core.windows.net/datasets/nlbse23-issue-classification-test.csv.tar.gz

Step-2: Install

sklearn and gensim. On Windows, install using the following command: pip install sklearn and pip install gensim .

Step-3: Download

git clone https://github.com/laiqujan/sgd-based-issue-classification.git

cd sgd-based-issue-classification

Step-4: Run

Run sgd-based-issue-classification.ipynb . Then execute all cells in the jupyter notebook and check the results.

Classifier

We implemented an SGDClassifier with the following parameters: SGDClassifier(loss='hinge', penalty='l2',alpha=0.000001, random_state=42,max_iter=20, tol=0.001) Additional hypermeters can be tried; visit for the full list.

Pre-processing

We followed standard preprocessing steps such as data cleaning and vectorization. We performed data cleaning mainly using Gensim, check the def preprocess (text) function. Then we applied TfidfVectorizer.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published