Nonparametric Density estimation of Identifiable mixture models from Grouped Observations
general: numpy, sklearn, scipy, pandas, tensorly, seaborn, matplotlib
topic modeling experiment: polyglot, glob, nltk, STTM (GitHub repo given in Main paper)
I have included STTM for convenience, but it is NOT my code. It is code accompanying a preprint with bibtex entry: @article{qiang2018STTP, title = {Short Text Topic Modeling Techniques, Applications, and Performance: A Survey }, author = {Qiang, Jipeng and Qian Zhenyu and Li, Yun and Yuan, Yunhao and Wu, Xindong}, journal = {arXiv preprint arXiv:1904.07695}, year = {2019} }
Scatter plots, density estimates, MAGIC figure: run the appropriate jupyter notebook
In sample / out of sample results on synthetic data over multiple runs at command line run the following
python runSyntheticExps.py --dataset --n --M --nRuns --outOfSample
Topic Modeling Example: This one is a bit more involved
If you just want to see the coherence results you can run (from STTM-master)
-you need to download the wikipedia dump 'enwiki-20200220-pages-articles1.xml-p1p30303.bz2' from https://dumps.wikimedia.org/enwiki/20200220/ and store this in STTM-master
-then you need to run (from STTM-master) python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
-from here you can train the models or evaluate topic coherence as follows:
java -jar jar/STTM.jar -model CoherenceEval -label wiki.en.text -dir results/ -topWords topWords
where results/ is a preexisting folder for each method containing .topWords files that contain the top 20 words for each topic found by a method on a given run
** Note ** The experimental results for NDIGO are representative of (but not exactly) what was reported in the submission. The results provided here are actually a bit better than what was shown in the paper.
If you want to run everything from scratch:
for LFDMM and GPUDMM run the following from the command line in STTM-master: java -jar jar/STTM.jar –model -corpus troll-corpus.txt -vectors glove10d.txt -ntopics 10 -name
for NDIGO: -run the jupiter notebook Twitter Example NDIGO.ipynb 5 times with 5 different random seeds. I have tried to make everything as easy as possible by including the data after preprocessing. All cells you DON'T need to run are commented out.
You can then run the evaluation code as described above