LSA dimensionality #28

piskvorky · 2011-05-04T17:14:48Z

Try the automated dimensionality setting for Latent Semantic Analysis, via MDL:

http://www.springerlink.com/content/500651582r310t05/

This means: reproduce Fig.1 from that article. See what it does on the Lee corpus. Does the curve make sense? Is MDL robust enough, across several corpora?

cperreault · 2011-05-18T15:09:43Z

Hi M. Řehůřek, thank you for opening an issue for automated dimensionality setting. Pierre-Yves Lafleur ([email protected]) and I ([email protected]) are currently trying to implement MDL with Gensim and test it. In a few days, we should be able to provide an answer in our context (small corpora) to your question "Is MDL robust enough, across several corpora?"
Does anyone one ever created a method to automate number of topics setting with LSA?

piskvorky · 2011-05-20T14:12:07Z

Great, getting rid of an extra parameter in LSA would be really cool!

Also note that @dedan added an easy way to select an LSA submodel (train on K topics, but use only L <= K topics for transformations). It is in commit 7711cbd . You simply set the lsa_model.numTopics attribute to some lower number.

tmylk · 2016-01-23T21:51:17Z

@cperreault Would you still like to add MDL to gensim?

cperreault · 2016-02-08T19:56:24Z

Yes, I would be interested! I would have to dive again in Gensim. I wrote an automatic number of topics "chooser", based on MDL, in my Master's thesis (http://www.theses.ulaval.ca/2013/29936/ - in French!). It is currently very custom implementation: analyzing a corpus with an increasing k, and storing results in a MySQL DB. If there is interest, I could contribute time and effort to (try to) implement it in a Gensim "native" way.

piskvorky · 2016-02-09T03:28:53Z

Oh yes, we're still interested. Thanks!

Depending on what the "analysis" entails, we could even make this the default behaviour. There's no need to re-train models with LSA just to tweak k (the topic spaces are conveniently nested, as mentioned above), so hopefully this analysis doesn't take much extra time?

ljdawn · 2016-08-05T08:11:09Z

@cperreault Hi, can we use the automatic number of topics "chooser" now ?

cperreault · 2016-09-27T02:15:29Z

@ljdawn @piskvorky. Hi! Sorry for my late response. I did not reworked on it and I don't expect to have time until a few months from now. However, if you think this could be useful, in the short term I can explain the simple principle that guided "my" automatic number of topics (k) chooser. (It is explained in my Master's thesis in French (see link above)) A given corpus is analyzed (LSA) with an increasing k, starting from k=1. For each k, the distribution of (dis)similarities among documents is calculated and rounded at each .1 between 0 and 1. The lowest k that allows most of documents to be considered dissimilar (between 0 and 0.1) is chosen. This is not rocket science, I know. It gave me interesting results among thousands of corpora analyzed, and, at least, a way to automatize analysis and a base of comparison between corpora. What do you think?

tmylk · 2016-09-27T04:59:43Z

FYI @dsquareindia this is related to your recent work on selecting the number of topics through coherence.

devashishd12 · 2016-09-27T06:18:22Z

Yes coherence can surely be used as an automatic "chooser" for LSA as well. We could choose the number of topics corresponding to the best coherence from 100 topics. I'll test it out on LSA and get back here. Currently I've only tested it out on LDA but I think with LSA it'll work better.

tmylk · 2016-09-27T06:24:07Z

It would be nice to compare coherence to the approach of @cperreault

piskvorky · 2016-10-30T10:41:28Z

LSA topics have no interpretation, so I don't think "coherence" (as in, semantically interpretable topics) makes much sense.

tmylk mentioned this issue Apr 12, 2017

Port metrics to select number of LDA topics to Python from R #1275

Open

menshikh-iv added the difficulty medium Medium issue: required good gensim understanding & python skills label Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSA dimensionality #28

LSA dimensionality #28

piskvorky commented May 4, 2011

cperreault commented May 18, 2011

piskvorky commented May 20, 2011

tmylk commented Jan 23, 2016

cperreault commented Feb 8, 2016

piskvorky commented Feb 9, 2016 •

edited

Loading

ljdawn commented Aug 5, 2016

cperreault commented Sep 27, 2016 •

edited

Loading

tmylk commented Sep 27, 2016

devashishd12 commented Sep 27, 2016

tmylk commented Sep 27, 2016

piskvorky commented Oct 30, 2016

LSA dimensionality #28

LSA dimensionality #28

Comments

piskvorky commented May 4, 2011

cperreault commented May 18, 2011

piskvorky commented May 20, 2011

tmylk commented Jan 23, 2016

cperreault commented Feb 8, 2016

piskvorky commented Feb 9, 2016 • edited Loading

ljdawn commented Aug 5, 2016

cperreault commented Sep 27, 2016 • edited Loading

tmylk commented Sep 27, 2016

devashishd12 commented Sep 27, 2016

tmylk commented Sep 27, 2016

piskvorky commented Oct 30, 2016

piskvorky commented Feb 9, 2016 •

edited

Loading

cperreault commented Sep 27, 2016 •

edited

Loading