Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Discrepancy in FP8 vs. BF16 Training with NanoGPT #1416

Open
wzzll123 opened this issue Jan 21, 2025 · 2 comments
Open

Performance Discrepancy in FP8 vs. BF16 Training with NanoGPT #1416

wzzll123 opened this issue Jan 21, 2025 · 2 comments

Comments

@wzzll123
Copy link

Hello,

I am using the NanoGPT-FP8 repository to enable FP8 training for the NanoGPT project. While testing on an 8x H800 setup with a 774M model, I noticed that FP8 training is 2x slower compared to BF16 training under the following conditions:

  • Batch size: 16
  • Gradient accumulation steps: 4
  • Model configuration:
    • Layers: 36
    • Heads: 20
    • Embedding size: 1280

For context, here’s how FP8 is implemented in the project:

fp8_format = Format.HYBRID  
fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=16, amax_compute_algo="max")  

@contextlib.contextmanager  
def ctx():  
    with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):  
        with torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16):  
            yield  

Observations:

  1. Multi-GPU Setup (8x H800):

    • BF16: ~1s per step
    • FP8: ~2.4s per step
  2. Single-GPU Setup (H800):

    • BF16: ~0.8s per step
    • FP8: ~0.6s per step

The single-GPU performance suggests that FP8 is faster than BF16 on one GPU, but the multi-GPU setup exhibits a significant slowdown for FP8. This makes me suspect that the performance difference is communication-related.

Is this performance difference expected? Are there any optimizations or strategies that could help mitigate this?

Looking forward to your insights.

@ptrendx
Copy link
Member

ptrendx commented Jan 24, 2025

Considering that in your setup both precisions are worse when run with multiple GPUs the most probable reason is that the problem is just too small and the overheads from casting/communication/CPU overheads of launching more kernels etc. are larger than the time GPUs spend on the actual work.
Depending on which of those different overheads is the culprit you could try different strategies:

@wzzll123
Copy link
Author

Considering that in your setup both precisions are worse when run with multiple GPUs the most probable reason is that the problem is just too small and the overheads from casting/communication/CPU overheads of launching more kernels etc. are larger than the time GPUs spend on the actual work. Depending on which of those different overheads is the culprit you could try different strategies:

Hi @ptrendx,

Thank you for your reply!

Just to clarify, while I mentioned my batch size is 16, it actually means each batch contains 16 * 1024 tokens. If I increase it to 32, my GPU runs out of memory.

I'll explore CUDA graphs in the coming days (I’m not very familiar with them yet).

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants