Add mSDA model #294

phdowling · 2015-02-17T13:27:19Z

Hey!

I suggested this a while back, and I made some progress so I figured I would bring it up again. I implemented the marginalized stacked denoising autoencoder algorithm described in http://www.cse.wustl.edu/~mchen/papers/msdadomain.pdf - is there any interest in adding this to Gensim?

My implementation is memory independent, and trains a 1000-dimensional model on Wikipedia in around 12 hours on my machine (8 cores @ 3ghz). I haven't tested it thoroughly yet, but some initial tests confirm that the representations it generates are able to capture topical similarity. Would be great if you could give me some ideas for a benchmark I could use to properly validate the model!

Repository is at https://github.com/phdowling/mSDA.

piskvorky · 2015-02-18T18:08:58Z

Looks good, thanks a lot Philipp!

CC @cscorley @gojomo @temerick @maciejkula for help with benchmarks / code review :)

phdowling · 2015-02-26T18:45:27Z

Hey again! How should I proceed with this - are there maybe some common classification tasks that I could look at for benchmarks? At what point should I create a PR?

phdowling · 2015-02-28T16:58:30Z

Small update: I ran a basic benchmark of text classification in Reuters 21578, comparing mSDA to simple bag of words, LSI, and random noise features. The good news is that mSDA is significantly better than random noise, the bad news that it is outperformed by LSI, which is however also outperformed by bag of words features.

More detailed results are below. I'm training only on the Reuters documents, so that might explain why both LSI and mSDA don't learn features that outperform bag of words. Note that this is mSDA at 200 dims, which is not typical - usually, around 1000 dimensions are so would be used.

### EVALUATION RESULTS ###
noise:
TP: 1    FP: 0
FN: 214  TN: 1785
Total test samples: 2000
Accuracy: 0.893
P:1.0    R: 0.0046511627907
F1: 0.00925925925926


msda:
TP: 17   FP: 10
FN: 197  TN: 1775
Total test samples: 1999
Accuracy: 0.896448224112
P:0.62962962963      R: 0.0794392523364
F1: 0.141078838174


bow:
TP: 119  FP: 15
FN: 95   TN: 1770
Total test samples: 1999
Accuracy: 0.944972486243
P:0.888059701493     R: 0.556074766355
F1: 0.683908045977


lsi:
TP: 83   FP: 26
FN: 131  TN: 1759
Total test samples: 1999
Accuracy: 0.921460730365
P:0.761467889908     R: 0.38785046729
F1: 0.513931888545

Here's mSDA at 1000 dimensions, otherwise the same task:

msda:
TP: 33   FP: 47
FN: 181  TN: 1738
Total test samples: 1999
Accuracy: 0.885942971486
P:0.4125     R: 0.154205607477
F1: 0.224489795918

mSDA is also quite a bit slower in evaluation, since it needs to do around (size_of_dictionary / output_dimensionality) + num_layers dot products to generate each representation (or chunk thereof).

I can't say for sure that there are no errors in my implementation, but since there seems to be some learning of useful patterns happening, I'm so far inclined to believe that the dimensional reduction mSDA does is simply not a very good model.

My question is now: If that was the case, would you want mSDA in Gensim anyway? Even if it's less useful than other models, I could still see that someone might want to use it for comparative purposes at some point.

tmylk · 2016-01-23T22:49:01Z

@piskvorky Is mSDA still on the wishlist?

tmylk added the wishlist Feature request label Jan 23, 2016

menshikh-iv added feature Issue described a new feature difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mSDA model #294

Add mSDA model #294

phdowling commented Feb 17, 2015

piskvorky commented Feb 18, 2015

phdowling commented Feb 26, 2015

phdowling commented Feb 28, 2015

tmylk commented Jan 23, 2016

Add mSDA model #294

Add mSDA model #294

Comments

phdowling commented Feb 17, 2015

piskvorky commented Feb 18, 2015

phdowling commented Feb 26, 2015

phdowling commented Feb 28, 2015

tmylk commented Jan 23, 2016