Leveraging pretrained models from KoELECTRA and adapting to train on the KorQuAD 2.1 dataset. Specifically,
- We added data preprocessing
- We modified the transformer to fit the KorQuAD 2.1 dataset
- We implemented the sliding window in long context to improve accuracy
- We created our own Q&A datasets on business report and used them for training
If you want to see backend and frontend of AIC, see AIC-BE / AIC-FE
- The koelectra finetuning is performed by referring to this link
- The transformer can be directly used through this huggingface link
- You can download the KorQuAD 2.1 dataset in this link
To eliminate unnecessary html tags from data files, run:
python tag_remover.py --task korquad --config_file koelectra-base-v3.json
You can just clone the KoELECTRA repo into your own computer. Then, overwrite our files in the KoELECTR/finetune
directory.
To train this model run:
python run_squad.py --task korquad --config_file koelectra-base-v3.json
To validate this model run:
python run_squad.py --task korquad --config_file koelectra-base-v3_test.json
Making custom dataset in the form of KorQuAD 2.1 form target files
python make_custom_dataset.py --data_dir {directory containing html files} --name 정빈
use name for distinguishing people when more than one are making dataset. (for unique id)