Skip to content

Latest commit

 

History

History
 
 

text-summarization

Text Summarization

This repository contains a SingularityNET service to do summarization of articles.

The current approach is that of Get To The Point: Summarization with Pointer-Generator Networks, as implemented by OpenNMT.

There are a number of other methods that may of interest. So we may implement separate services for each, while at the same keeping the interface the same (as much as possible)

Setup

These steps are run on Ubuntu 18.04, if you use a different distro/OS, the specifics may be different.

OpenNMT-py is a submodule, so you should clone the nlp-services repo with --recurse-submodules

git clone --recurse-submodules -j8 [email protected]:singnet/nlp-services.git

Now install the python dependencies for both OpenNMT and this project. Numpy is explicitly installed, as I ran into an error with the version from opennmt-py's requirements.

cd nlp-services/text-summarization
mkvirtualenv --python=/usr/bin/python3.6 text-summarization
pip install -r opennmt-py/requirements.txt
pip install numpy -I
pip install -r requirements.txt
./buildproto.sh

Last, you need to download the trained transformer model for summarization (details) and the Stanford CoreNLP java library. While an external java library is clunky, it was used as the tokenizer while training the model. We use it to tokenize new user input to avoid differences in tokenization algorithms affecting results.

python ../fetch_models.py

The above will download archives and extract to the nlp-services/text-summarization/models directory.

Running the server and making calls

Start the server with:

python -m services.summary_server

In another terminal, make a request to summarize an article with:

$ python client.py --source-text example_article.txt
 Senior National Collins is standing by her tweet of a fake news story. she says she had got her "sourcing" wrong, insisting it had some details wrong.

OpenNMT Notes

In its current state OpenNMT is biased towards command line usage. These commands, to be run in the opennmt-py directory, were useful for initially experimenting with the summarization models. (OpenNMT's translate.py script is used for calling all their models, whether they do language translation or not)

You will need to first download models from the opennmt-py model page.

python translate.py -gpu 0 \
                    -batch_size 20 \
                    -beam_size 5 \
                    -model ../models/gigaword_nocopy_acc_51.33_ppl_12.74_e20.pt \
                    -src ~/data/cnndm/test.txt.src \
                    -output cnndm.out \
                    -min_length 35 \
                    -verbose \
                    -stepwise_penalty \
                    -coverage_penalty summary \
                    -beta 5 \
                    -length_penalty wu \
                    -alpha 0.9 \
                    -verbose \
                    -block_ngram_repeat 3 \
                    -ignore_when_blocking "." "</t>" "<t>"
python translate.py -gpu 0 -batch_size 10 -beam_size 5 \
    -model ../models/sum_transformer_model_acc_57.25_ppl_9.22_e16.pt \
    -src ~/data/cnndm/test.txt.src \
    -output cnndm.out \
    -min_length 35 \
    -verbose \
    -stepwise_penalty \
    -coverage_penalty summary \
    -beta 5 \ 
    -length_penalty wu \
    -alpha 0.9 \
    -verbose \
    -block_ngram_repeat 3 \
    -ignore_when_blocking "." "</t>" "<t>" \
    -replace_unk

Data

Original CNN Daily Mail data here http://cs.nyu.edu/~kcho/DMQA/

These repos have preprocessing code, including tokenization using CoreNLP:

Tokenization is done by Stanford's Java library CoreNLP

License

MIT