This repository contains our code and documentation for participation in The NLBSE'23 Tool Competition.
We used issue reports data from real open source projects made available by (Kallis et al., [2023] (https://doi.org/10.1007/978-3-031-21388-5_34) for The NLBSE'23 Tool Competition.
Training Data: 1275881
Testing Data: 142320
Step-1: Get data
Training Data: https://tickettagger.blob.core.windows.net/datasets/nlbse23-issue-classification-train.csv.tar.gz
Testing Data: https://tickettagger.blob.core.windows.net/datasets/nlbse23-issue-classification-test.csv.tar.gz
Step-2: Install
sklearn and gensim. On Windows, install using the following command: pip install sklearn
and pip install gensim
.
Step-3: Download
git clone https://github.com/laiqujan/sgd-based-issue-classification.git
cd sgd-based-issue-classification
Step-4: Run
Run sgd-based-issue-classification.ipynb
. Then execute all cells in the jupyter notebook and check the results.
We implemented an SGDClassifier with the following parameters:
SGDClassifier(loss='hinge', penalty='l2',alpha=0.000001, random_state=42,max_iter=20, tol=0.001)
Additional hypermeters can be tried; visit
for the full list.
We followed standard preprocessing steps such as data cleaning and vectorization. We performed data cleaning mainly using Gensim, check the
def preprocess (text)
function. Then we applied TfidfVectorizer.