-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: improve im2col and AMD RX 5700XT performance #11826
base: master
Are you sure you want to change the base?
vulkan: improve im2col and AMD RX 5700XT performance #11826
Conversation
If cswave32 helps you, maybe look into if setting the |
That's nice to know, thanks. I'll look into it to check where it does make a difference because, as I mentioned, it actually slows down im2col. |
Real world performance (like sd.cpp benchmarks) are more important than GB/s in test-backend-ops perf. The dimensions of the real use are probably different, and there's other factors at play (for example graph execution instead of a single op getting repeated). |
I've been trying to use the VK_EXT_subgroup_size_control extension in GLSL to set the What I'm trying to do is to set the subgroupsize on this specific shader or at least apply the change to all the shaders through ggml-vulkan.cpp without having to rely on the RADV_PERFTEST env but for now I can't find a way to do that. |
Oh sorry, I thought you knew that the extension is already implemented. You can set values when loading pipelines: llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp Lines 1550 to 1552 in 8a8c4ce
|
Perf on rtx 4070:
|
Thanks a lot for the information, I've been trying to implement it directly in the shader not knowing it was already implemented in the main ggml-vulkan.cpp code. I did some tests by manually setting the subgroup size for 32 and I can recreate the results I had using the RADV env. Do you think it could be a good idea creating two pipelines (1 with subgroup 64 and 1 with subgroup 32) and use them only on specific GPUs? I don't know if it could be useful on other GPUs. |
Setting some pipelines to subgroup 64 and some to subgroup 32 I can get some good performance gains on stable diffusion xl 1024x1024 20 steps with tiled vae decode: stock 108.73 s, PR 107.59 s, PR + wave 32 105.66 s, PR + mixed pipelines 100.99 s. |
If there is a specific pipeline where it gives a significant advantage on RDNA, you could do that. Otherwise, just globally set the pipeline to 32 or 64 for RDNA, depending on what is better, and leave it alone for other vendors or older AMD. |
With the latest changes RX 5700 uses subgroup 32 when it's detected and forces subgroup 64 on IM2COL (other operations may be faster on subgroup 64 like softmax but since they don't make any difference in real world usage I didn't include them). With this approach + mesa-git the stable diffusion xl 1024x1024 20 steps went down to 98 s from the original 108 s. I'm still not sure if the ggml_vk_create_pipeline_64 used in im2col is the best approach (I really don't like how I had to duplicate that part of the code) but I couldn't figure out a better way to do it. |
9205de6
to
a4ef6dd
Compare
a4ef6dd
to
14ea4fa
Compare
I've pushed the last changes in which I remove the awful code duplication and instead introduce a helper function that checks if there's an override for the subgroup size from a map and sets it accordingly. I'm not sure if this could be useful or not on other GPUs (I wonder if maybe other AMD gpus or Intel ones may get some benefits from overriding certain subgroups). @0cc4m if you think this looks good I think that the PR is ready. |
d4ba722
to
04100e8
Compare
This PR supersedes #11778.
Here's the performance numbers on my Radeon RX 5700XT (RADV).
Vulkan
:HIP
:I'm also including the benchmark results when using
RADV_PERFTEST=cswave32
. It's interesting to note that despite this variable hurting this specific operation's performance it actually improves the speed in stable-diffusion.cpp (sd 1.5 512x512: without PR 1.38 it/s, with PR 1.45 it/s, with PR + cswave32 1.55 it/s).Vulkan cswave32
: