spikes #1

notconvergingwtf · 2019-02-27T17:39:00Z

Hi, do you have any suggestions on the next problem:
While training sdu(nadam,lr=0.00025), this is the loss on validation test:

Different model on the same training data was fine
Also, while training, lossvalue=nan starts to appear

deepinx · 2019-03-06T08:11:17Z

I just set network.sdu.net_coherent = True and revise line 579 of sym_heatmap.py to coherent_weight = 0.001, it seems this nan problem can be solved.

notconvergingwtf · 2019-03-06T10:57:32Z

Okay, thanks
Sorry, but how did you manage to figure this out? Seems, that network.sdu.net_coherent = True stands for leaving only this image transformations, that doesn't affect heatmap? How does this affect accuracy?

deepinx · 2019-03-06T11:40:04Z

I did this following the guides of the origial paper, as in the paper Therefore, we employ the CE loss for Lp-g and the MSE loss for Lp-p, respectively. l is empirically set as 0:001 to guarantee convergence

notconvergingwtf · 2019-03-06T12:51:38Z

Big thanks

notconvergingwtf · 2019-03-11T10:10:50Z

Hi, its me again. After some training time, here what i have:

It doesnt look like overfitting on train, may be some problems with convergence.. Have you met the same problem?

deepinx · 2019-03-11T14:34:19Z

What batch size and lr do you use? You can try different batch size or lr, perhaps it can solve your problem.

notconvergingwtf · 2019-03-11T14:41:01Z

Batch size is 16.Lr's are 1e-10 and 2e-6 (on screenshot). Well, as you can see, decreasing lr only delays time till spikes appear

deepinx · 2019-03-11T15:05:17Z

I used batch-size 16 and lr 0.00002 at the first several epochs. The spike did not appear. You can try the following commands:

NETWORK='sdu'
MODELDIR='./model_2d'
mkdir -p "$MODELDIR"
PREFIX="$MODELDIR/$NETWORK"
LOGFILE="$MODELDIR/log_$NETWORK"

CUDA_VISIBLE_DEVICES='0' python -u train.py --network $NETWORK --prefix "$PREFIX" --per-batch-size 16 --lr 0.00002 --lr-step '16000,24000,30000' > "$LOGFILE" 2>&1 &

If this problem still appears, you may check the network parameters in config.py.

notconvergingwtf closed this as completed Mar 6, 2019

notconvergingwtf reopened this Mar 11, 2019

notconvergingwtf mentioned this issue Mar 11, 2019

SDU alignment training: loss value is NAN deepinsight/insightface#583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spikes #1

spikes #1

notconvergingwtf commented Feb 27, 2019

deepinx commented Mar 6, 2019

notconvergingwtf commented Mar 6, 2019

deepinx commented Mar 6, 2019

notconvergingwtf commented Mar 6, 2019

notconvergingwtf commented Mar 11, 2019

deepinx commented Mar 11, 2019

notconvergingwtf commented Mar 11, 2019 •

edited

Loading

deepinx commented Mar 11, 2019 •

edited

Loading

spikes #1

spikes #1

Comments

notconvergingwtf commented Feb 27, 2019

deepinx commented Mar 6, 2019

notconvergingwtf commented Mar 6, 2019

deepinx commented Mar 6, 2019

notconvergingwtf commented Mar 6, 2019

notconvergingwtf commented Mar 11, 2019

deepinx commented Mar 11, 2019

notconvergingwtf commented Mar 11, 2019 • edited Loading

deepinx commented Mar 11, 2019 • edited Loading

notconvergingwtf commented Mar 11, 2019 •

edited

Loading

deepinx commented Mar 11, 2019 •

edited

Loading