Performance Discrepancy in FP8 vs. BF16 Training with NanoGPT #1416

wzzll123 · 2025-01-21T07:42:10Z

Hello,

I am using the NanoGPT-FP8 repository to enable FP8 training for the NanoGPT project. While testing on an 8x H800 setup with a 774M model, I noticed that FP8 training is 2x slower compared to BF16 training under the following conditions:

Batch size: 16
Gradient accumulation steps: 4
Model configuration:
- Layers: 36
- Heads: 20
- Embedding size: 1280

For context, here’s how FP8 is implemented in the project:

fp8_format = Format.HYBRID  
fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=16, amax_compute_algo="max")  

@contextlib.contextmanager  
def ctx():  
    with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):  
        with torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16):  
            yield

Observations:

Multi-GPU Setup (8x H800):
- BF16: ~1s per step
- FP8: ~2.4s per step
Single-GPU Setup (H800):
- BF16: ~0.8s per step
- FP8: ~0.6s per step

The single-GPU performance suggests that FP8 is faster than BF16 on one GPU, but the multi-GPU setup exhibits a significant slowdown for FP8. This makes me suspect that the performance difference is communication-related.

Is this performance difference expected? Are there any optimizations or strategies that could help mitigate this?

Looking forward to your insights.

ptrendx · 2025-01-24T01:34:09Z

Considering that in your setup both precisions are worse when run with multiple GPUs the most probable reason is that the problem is just too small and the overheads from casting/communication/CPU overheads of launching more kernels etc. are larger than the time GPUs spend on the actual work.
Depending on which of those different overheads is the culprit you could try different strategies:

if you can increase the workload, then that would be the easiest way (e.g. by increasing the batch size per GPU)
you could try running using CUDA graphs (see https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/ and https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/pytorch.html#transformer_engine.pytorch.make_graphed_callables)

wzzll123 · 2025-01-24T02:06:34Z

Considering that in your setup both precisions are worse when run with multiple GPUs the most probable reason is that the problem is just too small and the overheads from casting/communication/CPU overheads of launching more kernels etc. are larger than the time GPUs spend on the actual work. Depending on which of those different overheads is the culprit you could try different strategies:

if you can increase the workload, then that would be the easiest way (e.g. by increasing the batch size per GPU)

you could try running using CUDA graphs (see https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/ and https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/pytorch.html#transformer_engine.pytorch.make_graphed_callables)

Hi @ptrendx,

Thank you for your reply!

Just to clarify, while I mentioned my batch size is 16, it actually means each batch contains 16 * 1024 tokens. If I increase it to 32, my GPU runs out of memory.

I'll explore CUDA graphs in the coming days (I’m not very familiar with them yet).

Thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Discrepancy in FP8 vs. BF16 Training with NanoGPT #1416

Performance Discrepancy in FP8 vs. BF16 Training with NanoGPT #1416

wzzll123 commented Jan 21, 2025

ptrendx commented Jan 24, 2025

wzzll123 commented Jan 24, 2025

Performance Discrepancy in FP8 vs. BF16 Training with NanoGPT #1416

Performance Discrepancy in FP8 vs. BF16 Training with NanoGPT #1416

Comments

wzzll123 commented Jan 21, 2025

Observations:

ptrendx commented Jan 24, 2025

wzzll123 commented Jan 24, 2025