Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share PPL results #21

Open
saareliad opened this issue Aug 14, 2019 · 1 comment
Open

Share PPL results #21

saareliad opened this issue Aug 14, 2019 · 1 comment

Comments

@saareliad
Copy link

saareliad commented Aug 14, 2019

Hi,
It was not clear to me from the article what are your final PPL results for each model.
Can you share them too?

From a first look I actually thought that you achieve same or comparable PPL results. I am not sure about it now. Can you clarify?

Do you have a baseline model with comparable PPL to the original base model?

Can someone use what you did as a baseline for smaller scale research? (4-8 "commodity" GPUs for example?).

Extra detail on total training time:
I noticed that you count in tokens instead of steps,
were tokens_per_global_batch=global_batch_size*seq_len.
Using the parameters in the script, a simple calculation yields, in steps:

config num gpus max tokens seq len base batch global batch size tokens per batch required steps PPL
single machine 1 1.8B 128 32 32 4096 439453.125 ?
single machine 2 1.8B 128 32 64 8192 219726.5625 ?
single machine 4 1.8B 128 32 128 16384 109863.2813 ?

Comparing the the base_wiki103 config from the original repo
(they used only data parallel) we get:

config num gpus tokens seq len base batch global batch size tokens per batch steps PPL
original-base-wt103 don't care 1.92B 150 don't care 64 9600 200000 24

=>They trained on much more tokens.
If your results are really comparable, the model you present here is worth using as a baseline for future transfomerXL experiments because its faster. Right?

@yaroslavvb
Copy link

I don't have easily accessible perplexity results right now, will update once I get logging infrastructure. But generally switching to DDP from DP gave about 30% increase in throughput, so even with the same hyper-parameters you get a speed-up compared to original version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants