-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSA dimensionality #28
Comments
Hi M. Řehůřek, thank you for opening an issue for automated dimensionality setting. Pierre-Yves Lafleur ([email protected]) and I ([email protected]) are currently trying to implement MDL with Gensim and test it. In a few days, we should be able to provide an answer in our context (small corpora) to your question "Is MDL robust enough, across several corpora?" |
Great, getting rid of an extra parameter in LSA would be really cool! Also note that @dedan added an easy way to select an LSA submodel (train on K topics, but use only L <= K topics for transformations). It is in commit 7711cbd . You simply set the |
@cperreault Would you still like to add MDL to gensim? |
Yes, I would be interested! I would have to dive again in Gensim. I wrote an automatic number of topics "chooser", based on MDL, in my Master's thesis (http://www.theses.ulaval.ca/2013/29936/ - in French!). It is currently very custom implementation: analyzing a corpus with an increasing k, and storing results in a MySQL DB. If there is interest, I could contribute time and effort to (try to) implement it in a Gensim "native" way. |
Oh yes, we're still interested. Thanks! Depending on what the "analysis" entails, we could even make this the default behaviour. There's no need to re-train models with LSA just to tweak |
@cperreault Hi, can we use the automatic number of topics "chooser" now ? |
@ljdawn @piskvorky. Hi! Sorry for my late response. I did not reworked on it and I don't expect to have time until a few months from now. However, if you think this could be useful, in the short term I can explain the simple principle that guided "my" automatic number of topics (k) chooser. (It is explained in my Master's thesis in French (see link above)) A given corpus is analyzed (LSA) with an increasing k, starting from k=1. For each k, the distribution of (dis)similarities among documents is calculated and rounded at each .1 between 0 and 1. The lowest k that allows most of documents to be considered dissimilar (between 0 and 0.1) is chosen. This is not rocket science, I know. It gave me interesting results among thousands of corpora analyzed, and, at least, a way to automatize analysis and a base of comparison between corpora. What do you think? |
FYI @dsquareindia this is related to your recent work on selecting the number of topics through coherence. |
Yes coherence can surely be used as an automatic "chooser" for LSA as well. We could choose the number of topics corresponding to the best coherence from 100 topics. I'll test it out on LSA and get back here. Currently I've only tested it out on LDA but I think with LSA it'll work better. |
It would be nice to compare coherence to the approach of @cperreault |
LSA topics have no interpretation, so I don't think "coherence" (as in, semantically interpretable topics) makes much sense. |
Try the automated dimensionality setting for Latent Semantic Analysis, via MDL:
http://www.springerlink.com/content/500651582r310t05/
This means: reproduce Fig.1 from that article. See what it does on the Lee corpus. Does the curve make sense? Is MDL robust enough, across several corpora?
The text was updated successfully, but these errors were encountered: