-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A100 vs MI250X conv performance #3310
Comments
Hi @etiennemlb, naive kernels are the last resort when none of the other kernels are applicable. Could you provide more information about the tensor sizes? It would also be useful to have a minimal reproducer. |
Hello @averinevg. Thanks a lot for your reply! I'm @etiennemlb's colleague. We have prepared a minimal code example to reproduce the speed differences we observed, along with profiling results for various configurations. These details are available via the link below. I’ve also included the code here for your convenience.
We found that the MI250x is around 7 times slower than the A100 when using 6 CNN layers for input tensors of shape Many thanks in advance for your help! |
@averinevg Hey, have you had the time to take a look? The slowdown incurred in the example above makes the usage of mi250x cards impractical. |
Hi @formiel and @etiennemlb can you provide more information on your hardware, software (OS version, ROCm version) and rocminfo output? |
We run on a RHEL8 using rocm 5.7.1 or 6.0.0. We use HPE-Cray's bardpeak nodes equipped with 4 MI250X (8 gcd). If you are familiar with Frontier's nodes, well thats it. For me, using containers is a no go for this machine. |
Hi @etiennemlb, have you tried to upgrade ROCm to the latest release? https://rocm.docs.amd.com/en/latest/about/release-notes.html and enable logs https://rocm.docs.amd.com/projects/MIOpen/en/latest/how-to/debug-log.html Thanks. |
@huanrwan-amd experiencing this on 6.2.4 as well. |
Hi @etiennemlb and @formiel, any updates after updating ROCm? Please be advised that using Pytorch ROCm stack: https://github.com/ROCm/pytorch. In general, MIOpen benchmarks various kernels to select the most efficient one for a given operation. If the naive kernel is being selected, it might indicate that other kernels are not applicable or not performing well for your specific configuration. Naive convolution kernels are typically used as a fallback when more optimized kernels are not applicable. They are generally less efficient and can lead to slower performance. You can find more info on: https://rocm.docs.amd.com/projects/MIOpen/en/latest/conceptual/finddb.html |
Hello @etiennemlb one thing you can do is run your model once using these exports before you run PyTorch: export MIOPEN_FIND_ENFORCE=4 If you do this once, it will exhaustively test each convolution to find the best solver choice and store this so that when you next run, it will use this choice. This should ensure that if there is a non-naive solver for your convolution shape, it will be used going forward. If it still uses naive, I would love to hear it so we can investigate further. |
Note that you can also do "export MIOPEN_ENABLE_LOGGING_CMD=1" to have it output MIOpenDriver commands that if you then send us the commands, we can more easily test the issue and verify the problem without needing the extra PyTorch level. |
I would like to inquire about the performance of two kernels:
naive_conv_nonpacked_bwd_nchw_half_double_half
naive_conv_nonpacked_fwd_nchw_half_double_half
When are these used when we call
miopen_convolution_forward
? I have a pytorch model that is x6.4 times slower on MI250X compared to A100.The text was updated successfully, but these errors were encountered: