You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
It was not clear to me from the article what are your final PPL results for each model.
Can you share them too?
From a first look I actually thought that you achieve same or comparable PPL results. I am not sure about it now. Can you clarify?
Do you have a baseline model with comparable PPL to the original base model?
Can someone use what you did as a baseline for smaller scale research? (4-8 "commodity" GPUs for example?).
Extra detail on total training time:
I noticed that you count in tokens instead of steps,
were tokens_per_global_batch=global_batch_size*seq_len.
Using the parameters in the script, a simple calculation yields, in steps:
config
num gpus
max tokens
seq len
base batch
global batch size
tokens per batch
required steps
PPL
single machine
1
1.8B
128
32
32
4096
439453.125
?
single machine
2
1.8B
128
32
64
8192
219726.5625
?
single machine
4
1.8B
128
32
128
16384
109863.2813
?
Comparing the the base_wiki103 config from the original repo
(they used only data parallel) we get:
config
num gpus
tokens
seq len
base batch
global batch size
tokens per batch
steps
PPL
original-base-wt103
don't care
1.92B
150
don't care
64
9600
200000
24
=>They trained on much more tokens.
If your results are really comparable, the model you present here is worth using as a baseline for future transfomerXL experiments because its faster. Right?
The text was updated successfully, but these errors were encountered:
I don't have easily accessible perplexity results right now, will update once I get logging infrastructure. But generally switching to DDP from DP gave about 30% increase in throughput, so even with the same hyper-parameters you get a speed-up compared to original version.
Hi,
It was not clear to me from the article what are your final PPL results for each model.
Can you share them too?
From a first look I actually thought that you achieve same or comparable PPL results. I am not sure about it now. Can you clarify?
Do you have a baseline model with comparable PPL to the original base model?
Can someone use what you did as a baseline for smaller scale research? (4-8 "commodity" GPUs for example?).
Extra detail on total training time:
I noticed that you count in tokens instead of steps,
were
tokens_per_global_batch=global_batch_size*seq_len
.Using the parameters in the script, a simple calculation yields, in steps:
Comparing the the base_wiki103 config from the original repo
(they used only data parallel) we get:
=>They trained on much more tokens.
If your results are really comparable, the model you present here is worth using as a baseline for future transfomerXL experiments because its faster. Right?
The text was updated successfully, but these errors were encountered: