Skip to content

Topic Clustering Using Doc2Vec

Compare
Choose a tag to compare
@minasmz minasmz released this 19 Dec 14:15
· 4 commits to master since this release
893d354

By running this code on an arbitrary input text file (input.txt) you can cluster document paragraphs by their topics and then return a summary of each cluster.
For doing this you should train a doc2vec model on paragraphs of training set and put it in the project by the name (my_model_parags_from_wikiAggregate.doc2vec) then you can obtain vector of each input paragraph and calculate cosine similarity between each two paragraph in a row and if their similarity was more than a calculated threshold they assigns to a same cluster. After this we can obtain summary of each cluster separately and summarization does not miss important topics of each input text by this way.