hybrid-text-summarization/src/create_database_train_valid
This code was implemented to automatically download patents from the USPTO database.
The code was divided into five steps:
- Extracting codes from subgroups of a given class (searchSubgroups.py)
- Extraction of patent links from the USPTO page; (LinksExtract.py)
- Download the content of the links; (LinksDownload.py)
At the end of this step, there are files organized in two folders "summary/" "title/" each of the folders have subfolders with the name of the document class, where in these folders they have the document summary of patent (in "abstract/") in .txt format and the title of the document (in "title/") in .txt format.
- Database blend (blend_database.py)
- Removal of repeated files between documents in subgroups 43 47 52 and 56, removal of files with duplicate content and pre-processing of documents (organize_base.py)
Command to list duplicate files on linux
find . -type f -exec md5sum '{}' ';' | sort | uniq --all-repeated=separate -w 20 > ../duplicate_files.txt
In the folder, hybrid-text-summarization/src/create_database_train_valid/IDs/, you will find the IDs of all groups in which documents were collected.
hybrid-text-summarization/notebooks/hybrid_text_summarization.ipynb
The performance of different State-of-the-art algorithms in task of text summarization was evaluated.
-
Sumbasic: Paper: https://doi.org/10.1016/j.ipm.2007.01.023
Python Library: https://pypi.org/project/sumy/
-
TextRank:
Paper: https://aclanthology.org/W04-3252.pdf
Python Library: https://pypi.org/project/sumy/
-
BERT Extractive Summarizer: https://arxiv.org/abs/1906.04165
Paper: https://arxiv.org/abs/1908.08345
Github: https://github.com/nlpyang/PreSumm
Our adaptation https://github.com/CinthiaS/hybrid-text-summarization/blob/main/notebooks/BertSumm.ipynb
-
PreSumm:
Paper: https://arxiv.org/abs/1908.08345
Github: https://github.com/nlpyang/PreSumm
Our adaptation: https://github.com/CinthiaS/hybrid-text-summarization/blob/main/notebooks/presumm.ipynb
-
BigBird-Pegasus:
Paper: https://proceedings.neurips.cc//paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf
Github: https://github.com/google-research/bigbird
Our adaptation: https://github.com/CinthiaS/hybrid-text-summarization/blob/main/notebooks/bigbird-pegasus-summarization.ipynb
-
Seq2Seq + LSTM:
Paper: https://link.springer.com/article/10.1007/s11192-020-03732-x
To validate the results obtained, the ROUGE metrics and the NUBIA metric were used.