Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mSDA model #294

Open
phdowling opened this issue Feb 17, 2015 · 4 comments
Open

Add mSDA model #294

phdowling opened this issue Feb 17, 2015 · 4 comments
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills feature Issue described a new feature wishlist Feature request

Comments

@phdowling
Copy link
Contributor

Hey!

I suggested this a while back, and I made some progress so I figured I would bring it up again. I implemented the marginalized stacked denoising autoencoder algorithm described in http://www.cse.wustl.edu/~mchen/papers/msdadomain.pdf - is there any interest in adding this to Gensim?

My implementation is memory independent, and trains a 1000-dimensional model on Wikipedia in around 12 hours on my machine (8 cores @ 3ghz). I haven't tested it thoroughly yet, but some initial tests confirm that the representations it generates are able to capture topical similarity. Would be great if you could give me some ideas for a benchmark I could use to properly validate the model!

Repository is at https://github.com/phdowling/mSDA.

@piskvorky
Copy link
Owner

Looks good, thanks a lot Philipp!

CC @cscorley @gojomo @temerick @maciejkula for help with benchmarks / code review :)

@phdowling
Copy link
Contributor Author

Hey again! How should I proceed with this - are there maybe some common classification tasks that I could look at for benchmarks? At what point should I create a PR?

@phdowling
Copy link
Contributor Author

Small update: I ran a basic benchmark of text classification in Reuters 21578, comparing mSDA to simple bag of words, LSI, and random noise features. The good news is that mSDA is significantly better than random noise, the bad news that it is outperformed by LSI, which is however also outperformed by bag of words features.

More detailed results are below. I'm training only on the Reuters documents, so that might explain why both LSI and mSDA don't learn features that outperform bag of words. Note that this is mSDA at 200 dims, which is not typical - usually, around 1000 dimensions are so would be used.

### EVALUATION RESULTS ###
noise:
TP: 1    FP: 0
FN: 214  TN: 1785
Total test samples: 2000
Accuracy: 0.893
P:1.0    R: 0.0046511627907
F1: 0.00925925925926


msda:
TP: 17   FP: 10
FN: 197  TN: 1775
Total test samples: 1999
Accuracy: 0.896448224112
P:0.62962962963      R: 0.0794392523364
F1: 0.141078838174


bow:
TP: 119  FP: 15
FN: 95   TN: 1770
Total test samples: 1999
Accuracy: 0.944972486243
P:0.888059701493     R: 0.556074766355
F1: 0.683908045977


lsi:
TP: 83   FP: 26
FN: 131  TN: 1759
Total test samples: 1999
Accuracy: 0.921460730365
P:0.761467889908     R: 0.38785046729
F1: 0.513931888545

Here's mSDA at 1000 dimensions, otherwise the same task:

msda:
TP: 33   FP: 47
FN: 181  TN: 1738
Total test samples: 1999
Accuracy: 0.885942971486
P:0.4125     R: 0.154205607477
F1: 0.224489795918

mSDA is also quite a bit slower in evaluation, since it needs to do around (size_of_dictionary / output_dimensionality) + num_layers dot products to generate each representation (or chunk thereof).

I can't say for sure that there are no errors in my implementation, but since there seems to be some learning of useful patterns happening, I'm so far inclined to believe that the dimensional reduction mSDA does is simply not a very good model.

My question is now: If that was the case, would you want mSDA in Gensim anyway? Even if it's less useful than other models, I could still see that someone might want to use it for comparative purposes at some point.

@tmylk tmylk added the wishlist Feature request label Jan 23, 2016
@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

@piskvorky Is mSDA still on the wishlist?

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills feature Issue described a new feature wishlist Feature request
Projects
None yet
Development

No branches or pull requests

4 participants