You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the NanoGPT-FP8 repository to enable FP8 training for the NanoGPT project. While testing on an 8x H800 setup with a 774M model, I noticed that FP8 training is 2x slower compared to BF16 training under the following conditions:
Batch size: 16
Gradient accumulation steps: 4
Model configuration:
Layers: 36
Heads: 20
Embedding size: 1280
For context, here’s how FP8 is implemented in the project:
The single-GPU performance suggests that FP8 is faster than BF16 on one GPU, but the multi-GPU setup exhibits a significant slowdown for FP8. This makes me suspect that the performance difference is communication-related.
Is this performance difference expected? Are there any optimizations or strategies that could help mitigate this?
Looking forward to your insights.
The text was updated successfully, but these errors were encountered:
Considering that in your setup both precisions are worse when run with multiple GPUs the most probable reason is that the problem is just too small and the overheads from casting/communication/CPU overheads of launching more kernels etc. are larger than the time GPUs spend on the actual work.
Depending on which of those different overheads is the culprit you could try different strategies:
if you can increase the workload, then that would be the easiest way (e.g. by increasing the batch size per GPU)
Considering that in your setup both precisions are worse when run with multiple GPUs the most probable reason is that the problem is just too small and the overheads from casting/communication/CPU overheads of launching more kernels etc. are larger than the time GPUs spend on the actual work. Depending on which of those different overheads is the culprit you could try different strategies:
if you can increase the workload, then that would be the easiest way (e.g. by increasing the batch size per GPU)
Just to clarify, while I mentioned my batch size is 16, it actually means each batch contains 16 * 1024 tokens. If I increase it to 32, my GPU runs out of memory.
I'll explore CUDA graphs in the coming days (I’m not very familiar with them yet).
Hello,
I am using the NanoGPT-FP8 repository to enable FP8 training for the NanoGPT project. While testing on an 8x H800 setup with a 774M model, I noticed that FP8 training is 2x slower compared to BF16 training under the following conditions:
For context, here’s how FP8 is implemented in the project:
Observations:
Multi-GPU Setup (8x H800):
Single-GPU Setup (H800):
The single-GPU performance suggests that FP8 is faster than BF16 on one GPU, but the multi-GPU setup exhibits a significant slowdown for FP8. This makes me suspect that the performance difference is communication-related.
Is this performance difference expected? Are there any optimizations or strategies that could help mitigate this?
Looking forward to your insights.
The text was updated successfully, but these errors were encountered: