-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mSDA model #294
Comments
Looks good, thanks a lot Philipp! CC @cscorley @gojomo @temerick @maciejkula for help with benchmarks / code review :) |
Hey again! How should I proceed with this - are there maybe some common classification tasks that I could look at for benchmarks? At what point should I create a PR? |
Small update: I ran a basic benchmark of text classification in Reuters 21578, comparing mSDA to simple bag of words, LSI, and random noise features. The good news is that mSDA is significantly better than random noise, the bad news that it is outperformed by LSI, which is however also outperformed by bag of words features. More detailed results are below. I'm training only on the Reuters documents, so that might explain why both LSI and mSDA don't learn features that outperform bag of words. Note that this is mSDA at 200 dims, which is not typical - usually, around 1000 dimensions are so would be used.
Here's mSDA at 1000 dimensions, otherwise the same task:
mSDA is also quite a bit slower in evaluation, since it needs to do around I can't say for sure that there are no errors in my implementation, but since there seems to be some learning of useful patterns happening, I'm so far inclined to believe that the dimensional reduction mSDA does is simply not a very good model. My question is now: If that was the case, would you want mSDA in Gensim anyway? Even if it's less useful than other models, I could still see that someone might want to use it for comparative purposes at some point. |
@piskvorky Is mSDA still on the wishlist? |
Hey!
I suggested this a while back, and I made some progress so I figured I would bring it up again. I implemented the marginalized stacked denoising autoencoder algorithm described in http://www.cse.wustl.edu/~mchen/papers/msdadomain.pdf - is there any interest in adding this to Gensim?
My implementation is memory independent, and trains a 1000-dimensional model on Wikipedia in around 12 hours on my machine (8 cores @ 3ghz). I haven't tested it thoroughly yet, but some initial tests confirm that the representations it generates are able to capture topical similarity. Would be great if you could give me some ideas for a benchmark I could use to properly validate the model!
Repository is at https://github.com/phdowling/mSDA.
The text was updated successfully, but these errors were encountered: