Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online LDA with infinite vocabulary #213

Open
mjwillson opened this issue Jun 23, 2014 · 5 comments
Open

Online LDA with infinite vocabulary #213

mjwillson opened this issue Jun 23, 2014 · 5 comments
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills feature Issue described a new feature wishlist Feature request

Comments

@mjwillson
Copy link
Contributor

Potentially quite a biggie this one and I'm fully expecting a "patches welcome" response, but: when doing true online learning over document streams it's quite nice not to have to fix the vocabulary upfront. Also nice if you want to model the long tail of vocabulary to have a model whose update steps aren't linear in the vocabulary size.

There's a recent paper Online Latent Dirichlet Allocation with Infinite Vocabulary which extends the online variational inference approach from gensim's LdaModel to work in this setting, and could be a good starting point.

@mjwillson
Copy link
Contributor Author

There's also some Python code here: https://github.com/kzhai/InfVocLDA

@cscorley
Copy link
Contributor

Nice. I am actually working on some research that definitely needs this, and was just about to do a literature search on the topic (this afternoon, too!). I would definitely be interested in working on a branch for bringing this to Gensim.

@tmylk tmylk added the difficulty hard Hard issue: required deep gensim understanding & high python/cython skills label Jan 23, 2016
@menshikh-iv menshikh-iv added the feature Issue described a new feature label Oct 3, 2017
@gauravkoradiya
Copy link

How to do it on single machine?

@gauravkoradiya
Copy link

Potentially quite a biggie this one and I'm fully expecting a "patches welcome" response, but: when doing true online learning over document streams it's quite nice not to have to fix the vocabulary upfront. Also nice if you want to model the long tail of vocabulary to have a model whose update steps aren't linear in the vocabulary size.

There's a recent paper Online Latent Dirichlet Allocation with Infinite Vocabulary which extends the online variational inference approach from gensim's LdaModel to work in this setting, and could be a good starting point.

Memory out of memory issue if i have huge vocab with existing LDA in gensim. is this resolve that issue?

Repository owner locked and limited conversation to collaborators Aug 9, 2019
@piskvorky
Copy link
Owner

piskvorky commented Aug 9, 2019

@gauravkoradiya no, not related. Please stop hijacking unrelated issues. If you have some question, articulate it properly and use the mailing list.

Repository owner unlocked this conversation Aug 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills feature Issue described a new feature wishlist Feature request
Projects
None yet
Development

No branches or pull requests

6 participants