LDA model Incorrect treatment of damping factor when update() has multiple passes #298

wbogorad · 2015-02-27T02:06:52Z

Radim,

I think that rhot gets changed too frequently in ldamodel.update() than it should in online mode.
rhot is only supposed to damp new chunk of documents with time according to Hoffman's Algorithm 2.

However, in case of multiple passes it also damps the the updates progressively with each pass.
I am not sure that this is correct.
This is because do_mstep() is called during each pass and it increases self.num_updates variable that drives rhot
As another consequence new chunks get suppressed more than they should.

By the way, I appreciate that you introduced multiple passes, Hoffman does not have them and in my experiments they do help achieving more robust convergence.

What do you think?

cscorley · 2015-02-28T21:52:18Z

Let me see if I follow you correctly:

Passes effectively repeats the corpus given to update(), increasing t (=chunks/minibatches seen) because each chunk is seen again with each pass. For the first pass, everything is as it should be (in Algo 2). After the first pass, t is not reset; rather, it begins as if it has begun to seen more corpus (even though they are the same chunks).

This seems desirable to me; each pass through the corpus is damped more and more.

However, you are right about using update() on new corpora/chunks. E.g., if we use a single pass on corpus C_a (=10 chunks), then t=10. Later giving update() another corpus C_b (=7 chunks), then t=17. When multiple passes are introduced, say 5 passes, then after updating with C_a, t=50. This is bad news for C_b, it seems, as it will not start with t=10, but t=50. Yikes!

What would be a better way to approach this? Ensure that after a corpus has been through all of it's passes that t is set back to what it was after the first pass? E.g., following the above example, C_a begins with t=0, increases it to 50, but leaves it at 10. That way, C_b begins with t=10, increases it to 45, but leaves it at 17. Does it make sense to do that?

wbogorad · 2015-03-01T16:28:52Z

Chris,
Yes, that what has been worrying me.

As for a better way I think that rho(t) should be left for its intended purposes and another decaying variable rho(p) where p is for passes should be introduced.

Then for the next "corpus" rho(t) will take the value as prescribed by Algo 2 while rho(p) is going to be reset to its initial value. Seems it is going to be a nice compromise and an improvement.

Will you agree?

cscorley · 2015-04-25T22:41:48Z

Ping @wbogorad , @piskvorky

Let me know what you think about this implementation. I haven't tested it on Wikipedia (yet), but the rho(t) resets as expected when passes are involved. Below are the rho values at the beginning of the passes loop, before the update.

3 updates with a single pass (the expected values):

PASS 1: 1.0
PASS 1: 0.316227766017
PASS 1: 0.229415733871

3 updates with five passes (the test values):

update 1:

PASS 1: 1.0
PASS 2: 0.229415733871
PASS 3: 0.188982236505
PASS 4: 0.164398987305
PASS 5: 0.147441956155

update 2:

PASS 1: 0.316227766017
PASS 2: 0.164398987305
PASS 3: 0.134839972493
PASS 4: 0.117041147196
PASS 5: 0.104828483672

update 3:

PASS 1: 0.229415733871
PASS 2: 0.134839972493
PASS 3: 0.110431526075
PASS 4: 0.0957826285221
PASS 5: 0.0857492925713

Hence, for each update, the rho value resets, while increasing passes dampens the value for each pass.

- Fixes update_alpha problem when calling rho(), pass_ wasn't in scope - Change rho: - To not rely on num_updates/chunksize: when updates from a corpus size < chunksize, rho remains 1.0 for most updates. - *add* the pass count with the number of updates - Change the num_updates to actually be number of updates, rather than number of documents seen. Matches paper now, this relates to the chunksize problem.

- Fixes update_alpha problem when calling rho(), pass_ wasn't in scope - Change rho to *add* the pass count with the number of updates - On update, set chunksize to be the corpus length if len(corpus) < chunksize and no chunksize was specified

cscorley · 2015-04-27T11:47:25Z

Squashed the previous commits on the branch into more appropriate ones. The one specific to this issue is commit edc3ce5.

piskvorky · 2015-04-27T12:20:16Z

@wbogorad let us know if this works better for you. I have no preference here. Cheers!

wbogorad · 2015-04-29T15:35:42Z

Sorry guys for my slow replies.
Will review by the end of the week.

On Mon, Apr 27, 2015 at 5:20 AM, Radim Řehůřek [email protected]
wrote:

@wbogorad https://github.com/wbogorad let us know if this works better
for you. I have no preference here. Cheers!

—
Reply to this email directly or view it on GitHub
#298 (comment).

wbogorad · 2015-05-02T01:13:55Z

Hi Chris and Radim,

I have reviewed the code and I agree with Chris's changes.

To summarize, num_updates get increased only during pass_=0 thus rho
dampens more gently and more inline with Hoffman's implementation
considering that num_updates/chunksize corresponds to Hoffman's updatect
variable

Just for convenience here are relevant lines from his code:
def update_lambda(self, docs):
# do estep(docs)
...
rhot = pow(self._tau0 + self._updatect, -self._kappa)
...
self._updatect += 1

where docs correspond to mini-batches roughly similar to chunks.
So I support the change and recommend merging into the main branch.

Now I would like to add a few words about the differences between gensim
and Hoffman's implementation.
Hoffman's implementation is a true online algorithm meaning that it goes
through the data once and Hoffman relies on the properties described in his
paper that if the learning rate schedule rho(t) diverges but rho(t)**2
converges than the algorithm converges to the local minimum.
IMHO Ragim's implementation neither an online nor mini-batch but rather a
variation of the batch algorithm because it goes through the data several
times.

for each pass
for each chunk in corpus
perform the e-step
perform the m-step for each pass

I think that if one is OK with going through the data several times a
better algorithm would be the following:
for each chunk
for each pass
perform the e-step for each chunk updating gamma
perform the m-step for each chunk updating lambda

Please note that we keep the same gamma matrix in the inner loop.

This algorithm is a truly mini-batch gradient descent and the learning
schedule can be exactly as in Hoffman's.
Also note that when the chunk is processed it can be safely discarded
because it is not gong to be used again.
This algorithm has known theoretical guarantees and by varying the number
of passes or using relative bound() as stopping criteria it can fit data as
tightly or as loosely as desired depending how tolerant to overfitting the
problem is.

On Wed, Apr 29, 2015 at 8:35 AM, Walter Bogorad [email protected] wrote:

Sorry guys for my slow replies.
Will review by the end of the week.

On Mon, Apr 27, 2015 at 5:20 AM, Radim Řehůřek [email protected]
wrote:

@wbogorad https://github.com/wbogorad let us know if this works better
for you. I have no preference here. Cheers!

—
Reply to this email directly or view it on GitHub
#298 (comment).

cscorley · 2015-05-07T14:50:58Z

Great, I've merged in the changes thus far. Leaving this issue open to revisit the chunk/pass switch when time allows :)

piskvorky · 2015-05-07T15:06:59Z

@wbogorad the multipass is a strict extension of the original single pass algorithm. If you want Hoffman's original version, set passes=1 (the default).

tmylk · 2016-01-23T22:52:22Z

@cscorley Do you plan to continue working on the chunk/pass switch?

cscorley · 2016-01-23T23:43:16Z

I had worked on an implementation, but never got around to finishing it, for a couple reasons:

Right now, I know scikit-learn also takes the same approach. However, it's only a part of their batch mode. I'm not sure it was a move beyond doing what gensim does.

I think swapping them (especially in the online case) makes sense, as @wbogorad explains. Batch mode remains the same either way. But I'm also not sure if that sort of switch is going to be a big deal to end users or not. I'm guessing that it would be.

cscorley added a commit that referenced this issue Apr 25, 2015

Fixes issue #298, reset rho(t) for each additional pass

0ce2504

cscorley added a commit that referenced this issue Apr 27, 2015

Issue #298: Re-introduce rho relying on chunksize.

811c0ae

cscorley added a commit that referenced this issue May 24, 2015

Fixes issue #298 for LdaMulticore

b4d4d3b

menshikh-iv added bug Issue described a bug feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 3, 2017

jonaschn mentioned this issue May 11, 2021

Wrong LDA hyperparameter offset (downweighting factor tau0)? #3138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDA model Incorrect treatment of damping factor when update() has multiple passes #298

LDA model Incorrect treatment of damping factor when update() has multiple passes #298

wbogorad commented Feb 27, 2015

cscorley commented Feb 28, 2015

wbogorad commented Mar 1, 2015

cscorley commented Apr 25, 2015

cscorley commented Apr 27, 2015

piskvorky commented Apr 27, 2015

wbogorad commented Apr 29, 2015

wbogorad commented May 2, 2015

cscorley commented May 7, 2015

piskvorky commented May 7, 2015

tmylk commented Jan 23, 2016

cscorley commented Jan 23, 2016

LDA model Incorrect treatment of damping factor when update() has multiple passes #298

LDA model Incorrect treatment of damping factor when update() has multiple passes #298

Comments

wbogorad commented Feb 27, 2015

cscorley commented Feb 28, 2015

wbogorad commented Mar 1, 2015

cscorley commented Apr 25, 2015

3 updates with a single pass (the expected values):

3 updates with five passes (the test values):

update 1:

update 2:

update 3:

cscorley commented Apr 27, 2015

piskvorky commented Apr 27, 2015

wbogorad commented Apr 29, 2015

wbogorad commented May 2, 2015

cscorley commented May 7, 2015

piskvorky commented May 7, 2015

tmylk commented Jan 23, 2016

cscorley commented Jan 23, 2016