Topic Clustering Using Doc2Vec
·
4 commits
to master
since this release
By running this code on an arbitrary input text file (input.txt) you can cluster document paragraphs by their topics and then return a summary of each cluster.
For doing this you should train a doc2vec model on paragraphs of training set and put it in the project by the name (my_model_parags_from_wikiAggregate.doc2vec) then you can obtain vector of each input paragraph and calculate cosine similarity between each two paragraph in a row and if their similarity was more than a calculated threshold they assigns to a same cluster. After this we can obtain summary of each cluster separately and summarization does not miss important topics of each input text by this way.