Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spikes #1

Open
notconvergingwtf opened this issue Feb 27, 2019 · 8 comments
Open

spikes #1

notconvergingwtf opened this issue Feb 27, 2019 · 8 comments

Comments

@notconvergingwtf
Copy link

Hi, do you have any suggestions on the next problem:
While training sdu(nadam,lr=0.00025), this is the loss on validation test:
image
Different model on the same training data was fine
Also, while training, lossvalue=nan starts to appear

@deepinx
Copy link
Owner

deepinx commented Mar 6, 2019

I just set network.sdu.net_coherent = True and revise line 579 of sym_heatmap.py to coherent_weight = 0.001, it seems this nan problem can be solved.

@notconvergingwtf
Copy link
Author

Okay, thanks
Sorry, but how did you manage to figure this out? Seems, that network.sdu.net_coherent = True stands for leaving only this image transformations, that doesn't affect heatmap? How does this affect accuracy?

@deepinx
Copy link
Owner

deepinx commented Mar 6, 2019

I did this following the guides of the origial paper, as in the paper Therefore, we employ the CE loss for Lp-g and the MSE loss for Lp-p, respectively. l is empirically set as 0:001 to guarantee convergence

@notconvergingwtf
Copy link
Author

Big thanks

@notconvergingwtf
Copy link
Author

Hi, its me again. After some training time, here what i have:
image
It doesnt look like overfitting on train, may be some problems with convergence.. Have you met the same problem?

@deepinx
Copy link
Owner

deepinx commented Mar 11, 2019

What batch size and lr do you use? You can try different batch size or lr, perhaps it can solve your problem.

@notconvergingwtf
Copy link
Author

notconvergingwtf commented Mar 11, 2019

Batch size is 16.Lr's are 1e-10 and 2e-6 (on screenshot). Well, as you can see, decreasing lr only delays time till spikes appear

@deepinx
Copy link
Owner

deepinx commented Mar 11, 2019

I used batch-size 16 and lr 0.00002 at the first several epochs. The spike did not appear. You can try the following commands:

NETWORK='sdu'
MODELDIR='./model_2d'
mkdir -p "$MODELDIR"
PREFIX="$MODELDIR/$NETWORK"
LOGFILE="$MODELDIR/log_$NETWORK"

CUDA_VISIBLE_DEVICES='0' python -u train.py --network $NETWORK --prefix "$PREFIX" --per-batch-size 16 --lr 0.00002 --lr-step '16000,24000,30000' > "$LOGFILE" 2>&1 &

If this problem still appears, you may check the network parameters in config.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants