Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDA model Incorrect treatment of damping factor when update() has multiple passes #298

Open
wbogorad opened this issue Feb 27, 2015 · 11 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@wbogorad
Copy link

Radim,

I think that rhot gets changed too frequently in ldamodel.update() than it should in online mode.
rhot is only supposed to damp new chunk of documents with time according to Hoffman's Algorithm 2.

However, in case of multiple passes it also damps the the updates progressively with each pass.
I am not sure that this is correct.
This is because do_mstep() is called during each pass and it increases self.num_updates variable that drives rhot
As another consequence new chunks get suppressed more than they should.

By the way, I appreciate that you introduced multiple passes, Hoffman does not have them and in my experiments they do help achieving more robust convergence.

What do you think?

@cscorley
Copy link
Contributor

Let me see if I follow you correctly:

Passes effectively repeats the corpus given to update(), increasing t (=chunks/minibatches seen) because each chunk is seen again with each pass. For the first pass, everything is as it should be (in Algo 2). After the first pass, t is not reset; rather, it begins as if it has begun to seen more corpus (even though they are the same chunks).

This seems desirable to me; each pass through the corpus is damped more and more.

However, you are right about using update() on new corpora/chunks. E.g., if we use a single pass on corpus C_a (=10 chunks), then t=10. Later giving update() another corpus C_b (=7 chunks), then t=17. When multiple passes are introduced, say 5 passes, then after updating with C_a, t=50. This is bad news for C_b, it seems, as it will not start with t=10, but t=50. Yikes!

What would be a better way to approach this? Ensure that after a corpus has been through all of it's passes that t is set back to what it was after the first pass? E.g., following the above example, C_a begins with t=0, increases it to 50, but leaves it at 10. That way, C_b begins with t=10, increases it to 45, but leaves it at 17. Does it make sense to do that?

@wbogorad
Copy link
Author

wbogorad commented Mar 1, 2015

Chris,
Yes, that what has been worrying me.

As for a better way I think that rho(t) should be left for its intended purposes and another decaying variable rho(p) where p is for passes should be introduced.

Then for the next "corpus" rho(t) will take the value as prescribed by Algo 2 while rho(p) is going to be reset to its initial value. Seems it is going to be a nice compromise and an improvement.

Will you agree?

@cscorley
Copy link
Contributor

Ping @wbogorad , @piskvorky

Let me know what you think about this implementation. I haven't tested it on Wikipedia (yet), but the rho(t) resets as expected when passes are involved. Below are the rho values at the beginning of the passes loop, before the update.

3 updates with a single pass (the expected values):

PASS 1: 1.0
PASS 1: 0.316227766017
PASS 1: 0.229415733871

3 updates with five passes (the test values):

update 1:

PASS 1: 1.0
PASS 2: 0.229415733871
PASS 3: 0.188982236505
PASS 4: 0.164398987305
PASS 5: 0.147441956155

update 2:

PASS 1: 0.316227766017
PASS 2: 0.164398987305
PASS 3: 0.134839972493
PASS 4: 0.117041147196
PASS 5: 0.104828483672

update 3:

PASS 1: 0.229415733871
PASS 2: 0.134839972493
PASS 3: 0.110431526075
PASS 4: 0.0957826285221
PASS 5: 0.0857492925713

Hence, for each update, the rho value resets, while increasing passes dampens the value for each pass.

cscorley added a commit that referenced this issue Apr 27, 2015
- Fixes update_alpha problem when calling rho(), pass_ wasn't in scope
- Change rho:
    - To not rely on num_updates/chunksize: when updates from a
    corpus size < chunksize, rho remains 1.0 for most updates.
    - *add* the pass count with the number of updates
    - Change the num_updates to actually be number of updates, rather than
    number of documents seen. Matches paper now, this relates to the chunksize
    problem.
cscorley added a commit that referenced this issue Apr 27, 2015
- Fixes update_alpha problem when calling rho(), pass_ wasn't in scope
- Change rho to *add* the pass count with the number of updates
- On update, set chunksize to be the corpus length if len(corpus) < chunksize
and no chunksize was specified
@cscorley
Copy link
Contributor

Squashed the previous commits on the branch into more appropriate ones. The one specific to this issue is commit edc3ce5.

@piskvorky
Copy link
Owner

@wbogorad let us know if this works better for you. I have no preference here. Cheers!

@wbogorad
Copy link
Author

Sorry guys for my slow replies.
Will review by the end of the week.

On Mon, Apr 27, 2015 at 5:20 AM, Radim Řehůřek [email protected]
wrote:

@wbogorad https://github.com/wbogorad let us know if this works better
for you. I have no preference here. Cheers!


Reply to this email directly or view it on GitHub
#298 (comment).

@wbogorad
Copy link
Author

wbogorad commented May 2, 2015

Hi Chris and Radim,

I have reviewed the code and I agree with Chris's changes.

To summarize, num_updates get increased only during pass_=0 thus rho
dampens more gently and more inline with Hoffman's implementation
considering that num_updates/chunksize corresponds to Hoffman's updatect
variable

Just for convenience here are relevant lines from his code:
def update_lambda(self, docs):
# do estep(docs)
...
rhot = pow(self._tau0 + self._updatect, -self._kappa)
...
self._updatect += 1

where docs correspond to mini-batches roughly similar to chunks.
So I support the change and recommend merging into the main branch.

Now I would like to add a few words about the differences between gensim
and Hoffman's implementation.
Hoffman's implementation is a true online algorithm meaning that it goes
through the data once and Hoffman relies on the properties described in his
paper that if the learning rate schedule rho(t) diverges but rho(t)**2
converges than the algorithm converges to the local minimum.
IMHO Ragim's implementation neither an online nor mini-batch but rather a
variation of the batch algorithm because it goes through the data several
times.

for each pass
for each chunk in corpus
perform the e-step
perform the m-step for each pass

I think that if one is OK with going through the data several times a
better algorithm would be the following:
for each chunk
for each pass
perform the e-step for each chunk updating gamma
perform the m-step for each chunk updating lambda

Please note that we keep the same gamma matrix in the inner loop.

This algorithm is a truly mini-batch gradient descent and the learning
schedule can be exactly as in Hoffman's.
Also note that when the chunk is processed it can be safely discarded
because it is not gong to be used again.
This algorithm has known theoretical guarantees and by varying the number
of passes or using relative bound() as stopping criteria it can fit data as
tightly or as loosely as desired depending how tolerant to overfitting the
problem is.

On Wed, Apr 29, 2015 at 8:35 AM, Walter Bogorad [email protected] wrote:

Sorry guys for my slow replies.
Will review by the end of the week.

On Mon, Apr 27, 2015 at 5:20 AM, Radim Řehůřek [email protected]
wrote:

@wbogorad https://github.com/wbogorad let us know if this works better
for you. I have no preference here. Cheers!


Reply to this email directly or view it on GitHub
#298 (comment).

@cscorley
Copy link
Contributor

cscorley commented May 7, 2015

Great, I've merged in the changes thus far. Leaving this issue open to revisit the chunk/pass switch when time allows :)

@piskvorky
Copy link
Owner

@wbogorad the multipass is a strict extension of the original single pass algorithm. If you want Hoffman's original version, set passes=1 (the default).

cscorley added a commit that referenced this issue May 24, 2015
@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

@cscorley Do you plan to continue working on the chunk/pass switch?

@cscorley
Copy link
Contributor

I had worked on an implementation, but never got around to finishing it, for a couple reasons:

Right now, I know scikit-learn also takes the same approach. However, it's only a part of their batch mode. I'm not sure it was a move beyond doing what gensim does.

I think swapping them (especially in the online case) makes sense, as @wbogorad explains. Batch mode remains the same either way. But I'm also not sure if that sort of switch is going to be a big deal to end users or not. I'm guessing that it would be.

@menshikh-iv menshikh-iv added bug Issue described a bug feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

6 participants
@cscorley @piskvorky @tmylk @wbogorad @menshikh-iv and others