Skip to content

Nonparametric Density estimation of Identifiable mixture models from Grouped Observations

Notifications You must be signed in to change notification settings

alex-ritchie/NDIGO

Repository files navigation

NDIGO

Nonparametric Density estimation of Identifiable mixture models from Grouped Observations

Dependencies

general: numpy, sklearn, scipy, pandas, tensorly, seaborn, matplotlib

topic modeling experiment: polyglot, glob, nltk, STTM (GitHub repo given in Main paper)

I have included STTM for convenience, but it is NOT my code. It is code accompanying a preprint with bibtex entry: @article{qiang2018STTP, title = {Short Text Topic Modeling Techniques, Applications, and Performance: A Survey }, author = {Qiang, Jipeng and Qian Zhenyu and Li, Yun and Yuan, Yunhao and Wu, Xindong}, journal = {arXiv preprint arXiv:1904.07695}, year = {2019} }

Instructions for Reproduction of Results

Scatter plots, density estimates, MAGIC figure: run the appropriate jupyter notebook

In sample / out of sample results on synthetic data over multiple runs at command line run the following

python runSyntheticExps.py --dataset --n --M --nRuns --outOfSample

Topic Modeling Example: This one is a bit more involved

If you just want to see the coherence results you can run (from STTM-master)

-you need to download the wikipedia dump 'enwiki-20200220-pages-articles1.xml-p1p30303.bz2' from https://dumps.wikimedia.org/enwiki/20200220/ and store this in STTM-master

-then you need to run (from STTM-master) python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text

-from here you can train the models or evaluate topic coherence as follows:

java -jar jar/STTM.jar -model CoherenceEval -label wiki.en.text -dir results/ -topWords topWords

where results/ is a preexisting folder for each method containing .topWords files that contain the top 20 words for each topic found by a method on a given run

** Note ** The experimental results for NDIGO are representative of (but not exactly) what was reported in the submission. The results provided here are actually a bit better than what was shown in the paper.

If you want to run everything from scratch:

for LFDMM and GPUDMM run the following from the command line in STTM-master: java -jar jar/STTM.jar –model -corpus troll-corpus.txt -vectors glove10d.txt -ntopics 10 -name

for NDIGO: -run the jupiter notebook Twitter Example NDIGO.ipynb 5 times with 5 different random seeds. I have tried to make everything as easy as possible by including the data after preprocessing. All cells you DON'T need to run are commented out.

You can then run the evaluation code as described above

About

Nonparametric Density estimation of Identifiable mixture models from Grouped Observations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published