Skip to content

Keyword Extraction Added

Compare
Choose a tag to compare
@minasmz minasmz released this 17 Jan 13:14
· 4 commits to master since this release
893d354

In this release I have added keyword extraction to extract most important and frequent n-gram words in a text.
For doing this I use two approaches and both of them are accurate on variety of corpuses I used, but since I do not access to a gold standard set I did not got any result to publish.
In this code you can give it a big text file as input, it clusters every topic in a row and summarizes and returns keywords of any cluster.
For keyword extraction, one of the approaches which has been used as default (with_embeded=False) is using some kind of tf-idf which used in Gensim text summarization and the name is bm25.py. I added this code to my script with some alternation. This can return most important words in each text as input. The second approach (with_embeded=True) uses a word2vec model and creates a graph of words and the similarity between them is the weight of the edges between them, then apply a textRank algorithm which implementation is in Gensim library and I added the code in my script with some changes. It returns most important words with considering the meaning of other words in the input text.
After acquiring most important and frequent words it checks them in unigram to n-gram of the words (default n is 10) if frequency of n-gram be more than half of the frequency of the word and its occurrence be more than 2 (in big size inputs this number should be increased) the important word occurred in the text would reduce to the bigger n-gram and Finally returns the most important n-gram that important words are in them.