vulkan: improve im2col and AMD RX 5700XT performance #11826

daniandtheweb · 2025-02-12T13:52:03Z

This PR supersedes #11778.
Here's the performance numbers on my Radeon RX 5700XT (RADV).

Vulkan:

Master:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    95.82 us/run -    10244 kB/run -  101.96 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1640 runs -   639.72 us/run -    40964 kB/run -   61.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 79006.56 us/run -   655364 kB/run -    7.93 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1312 runs -   900.50 us/run -   102445 kB/run -  108.53 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 20310.68 us/run -   409645 kB/run -   19.26 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   236.25 us/run -    23536 kB/run -   95.01 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  3518.08 us/run -   100208 kB/run -   27.17 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 313087.20 us/run -  1678448 kB/run -    5.12 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      572 runs -  2263.98 us/run -   235365 kB/run -   99.18 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 71241.88 us/run -  1002085 kB/run -   13.43 GB/s

PR:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              19656 runs -    58.61 us/run -    10244 kB/run -  166.70 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1640 runs -   664.67 us/run -    40964 kB/run -   58.78 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 79993.31 us/run -   655364 kB/run -    7.83 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1968 runs -   602.13 us/run -   102445 kB/run -  162.31 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 17203.68 us/run -   409645 kB/run -   22.74 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               5704 runs -   227.95 us/run -    23536 kB/run -   98.47 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                670 runs -  2895.43 us/run -   100208 kB/run -   33.01 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 315784.75 us/run -  1678448 kB/run -    5.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      572 runs -  2218.08 us/run -   235365 kB/run -  101.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 72422.56 us/run -  1002085 kB/run -   13.21 GB/s

HIP:

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               3276 runs -   923.04 us/run -    10244 kB/run -   10.58 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                820 runs -  3359.14 us/run -    40964 kB/run -   11.63 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 58759.37 us/run -   655364 kB/run -   10.66 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      328 runs -  9271.95 us/run -   102445 kB/run -   10.54 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 79109.07 us/run -   409645 kB/run -    4.94 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1426 runs -  1077.72 us/run -    23536 kB/run -   20.83 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  9038.40 us/run -   100208 kB/run -   10.57 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 160059.45 us/run -  1678448 kB/run -   10.02 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      143 runs - 12785.92 us/run -   235365 kB/run -   17.56 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 120635.09 us/run -  1002085 kB/run -    7.93 GB/s

I'm also including the benchmark results when using RADV_PERFTEST=cswave32. It's interesting to note that despite this variable hurting this specific operation's performance it actually improves the speed in stable-diffusion.cpp (sd 1.5 512x512: without PR 1.38 it/s, with PR 1.45 it/s, with PR + cswave32 1.55 it/s).

Vulkan cswave32:

Master:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    94.26 us/run -    10244 kB/run -  103.65 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                820 runs -  1426.19 us/run -    40964 kB/run -   27.40 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 122464.00 us/run -   655364 kB/run -    5.11 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      984 runs -  1240.23 us/run -   102445 kB/run -   78.80 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 67429.80 us/run -   409645 kB/run -    5.80 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   325.81 us/run -    23536 kB/run -   68.89 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  7415.25 us/run -   100208 kB/run -   12.89 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 254549.75 us/run -  1678448 kB/run -    6.30 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      286 runs -  4823.12 us/run -   235365 kB/run -   46.55 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 358562.76 us/run -  1002085 kB/run -    2.67 GB/s

PR:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    81.44 us/run -    10244 kB/run -  119.98 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                820 runs -  1439.59 us/run -    40964 kB/run -   27.14 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 135311.06 us/run -   655364 kB/run -    4.63 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      984 runs -  1191.29 us/run -   102445 kB/run -   82.04 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 62476.68 us/run -   409645 kB/run -    6.26 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   315.84 us/run -    23536 kB/run -   71.07 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  7497.88 us/run -   100208 kB/run -   12.75 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 254943.60 us/run -  1678448 kB/run -    6.29 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      286 runs -  4767.25 us/run -   235365 kB/run -   47.10 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 358791.47 us/run -  1002085 kB/run -    2.67 GB/s

0cc4m · 2025-02-13T10:10:31Z

If cswave32 helps you, maybe look into if setting the requiredSubgroupSize in the shader does the same thing.

daniandtheweb · 2025-02-13T11:20:17Z

That's nice to know, thanks. I'll look into it to check where it does make a difference because, as I mentioned, it actually slows down im2col.

0cc4m · 2025-02-13T13:04:16Z

That's nice to know, thanks. I'll look into it to check where it does make a difference because, as I mentioned, it actually slows down im2col.

Real world performance (like sd.cpp benchmarks) are more important than GB/s in test-backend-ops perf. The dimensions of the real use are probably different, and there's other factors at play (for example graph execution instead of a single op getting repeated).

daniandtheweb · 2025-02-13T17:29:56Z

I've been trying to use the VK_EXT_subgroup_size_control extension in GLSL to set the requiredSubgroupSize but I can't manage to make it work. Is the right approach to enable the "GL_EXT_subgroup_size_control" extension and then appending requiredSubgroupSize to the layout in which local_size_x_id is set or am I trying to use the extension wrongly?

What I'm trying to do is to set the subgroupsize on this specific shader or at least apply the change to all the shaders through ggml-vulkan.cpp without having to rely on the RADV_PERFTEST env but for now I can't find a way to do that.

0cc4m · 2025-02-13T18:23:25Z

Oh sorry, I thought you knew that the extension is already implemented. You can set values when loading pipelines:

llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp

Lines 1550 to 1552 in 8a8c4ce

    
           auto const &ggml_vk_create_pipeline = [&](vk_device& device, vk_pipeline& pipeline, const std::string &name, size_t spv_size, const void* spv_data, const std::string &entrypoint, 
        
                                                     uint32_t parameter_count, uint32_t push_constant_size, std::array<uint32_t, 3> wg_denoms, const std::vector<uint32_t>& specialization_constants, 
        
                                                     uint32_t align, bool disable_robustness = false, bool require_full_subgroups = false, uint32_t required_subgroup_size = 0) {

jeffbolznv · 2025-02-13T18:42:05Z

Perf on rtx 4070:

before
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              36036 runs -    30.08 us/run -    10244 kB/run -  324.76 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               6560 runs -   153.85 us/run -    40964 kB/run -  253.95 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      156 runs -  8152.89 us/run -   655364 kB/run -   76.81 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     3608 runs -   287.17 us/run -   102445 kB/run -  340.32 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      574 runs -  1961.57 us/run -   409645 kB/run -  199.40 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   278.76 us/run -    23536 kB/run -   80.52 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1312.75 us/run -   100208 kB/run -   72.81 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       40 runs - 37220.75 us/run -  1678448 kB/run -   43.09 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  2852.16 us/run -   235365 kB/run -   78.72 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       68 runs - 21740.88 us/run -  1002085 kB/run -   44.01 GB/s
  
after
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              36036 runs -    28.20 us/run -    10244 kB/run -  346.41 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               7380 runs -   146.53 us/run -    40964 kB/run -  266.65 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      156 runs -  8175.81 us/run -   655364 kB/run -   76.59 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     3936 runs -   263.05 us/run -   102445 kB/run -  371.53 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      574 runs -  1958.18 us/run -   409645 kB/run -  199.75 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   280.32 us/run -    23536 kB/run -   80.07 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1320.69 us/run -   100208 kB/run -   72.37 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       40 runs - 37208.22 us/run -  1678448 kB/run -   43.10 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  2866.03 us/run -   235365 kB/run -   78.34 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       68 runs - 21979.57 us/run -  1002085 kB/run -   43.53 GB/s

daniandtheweb · 2025-02-13T22:00:15Z

Thanks a lot for the information, I've been trying to implement it directly in the shader not knowing it was already implemented in the main ggml-vulkan.cpp code.

I did some tests by manually setting the subgroup size for 32 and I can recreate the results I had using the RADV env.
I also tested creating a second pipeline which uses 64 as subgroup size and used it only on im2col and I managed to get another speedup on stable-diffusion (1.55 it/s PR + subgroup 32 vs 1.59 it/s PR + mixed subgroups) and apparently there seem to be other operations I'm currently testing that work faster on subgroup 64 than 32 other than im2col. (IM2COL 64, MUL_MAT some faster on 32 and some on 64, SOFT_MAX 64, ADD 32, CPY 32)

Do you think it could be a good idea creating two pipelines (1 with subgroup 64 and 1 with subgroup 32) and use them only on specific GPUs? I don't know if it could be useful on other GPUs.

daniandtheweb · 2025-02-13T23:08:19Z

Setting some pipelines to subgroup 64 and some to subgroup 32 I can get some good performance gains on stable diffusion xl 1024x1024 20 steps with tiled vae decode: stock 108.73 s, PR 107.59 s, PR + wave 32 105.66 s, PR + mixed pipelines 100.99 s.

0cc4m · 2025-02-14T07:36:47Z

Do you think it could be a good idea creating two pipelines (1 with subgroup 64 and 1 with subgroup 32) and use them only on specific GPUs? I don't know if it could be useful on other GPUs.

If there is a specific pipeline where it gives a significant advantage on RDNA, you could do that. Otherwise, just globally set the pipeline to 32 or 64 for RDNA, depending on what is better, and leave it alone for other vendors or older AMD.

daniandtheweb · 2025-02-14T16:34:37Z

With the latest changes RX 5700 uses subgroup 32 when it's detected and forces subgroup 64 on IM2COL (other operations may be faster on subgroup 64 like softmax but since they don't make any difference in real world usage I didn't include them).

With this approach + mesa-git the stable diffusion xl 1024x1024 20 steps went down to 98 s from the original 108 s.

I'm still not sure if the ggml_vk_create_pipeline_64 used in im2col is the best approach (I really don't like how I had to duplicate that part of the code) but I couldn't figure out a better way to do it.

daniandtheweb · 2025-02-14T21:31:59Z

I've pushed the last changes in which I remove the awful code duplication and instead introduce a helper function that checks if there's an override for the subgroup size from a map and sets it accordingly. I'm not sure if this could be useful or not on other GPUs (I wonder if maybe other AMD gpus or Intel ones may get some benefits from overriding certain subgroups).

@0cc4m if you think this looks good I think that the PR is ready.

vulkan: improve im2col performance

62733f2

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 12, 2025

0cc4m self-requested a review February 13, 2025 10:07

Force subgroup 32 on RX 5700 and subgroup 64 for im2col

35f6369

Fixed uint

293edef

daniandtheweb changed the title ~~vulkan: improve im2col performance~~ vulkan: improve im2col and AMD RX 5700XT performance Feb 14, 2025

daniandtheweb force-pushed the vk-shader-optimizations-1 branch 2 times, most recently from 9205de6 to a4ef6dd Compare February 14, 2025 21:22

Helper function to set subgroup size

14ea4fa

daniandtheweb force-pushed the vk-shader-optimizations-1 branch from a4ef6dd to 14ea4fa Compare February 14, 2025 21:22

Optimize soft_max

04100e8

daniandtheweb force-pushed the vk-shader-optimizations-1 branch from d4ba722 to 04100e8 Compare February 14, 2025 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: improve im2col and AMD RX 5700XT performance #11826

vulkan: improve im2col and AMD RX 5700XT performance #11826

daniandtheweb commented Feb 12, 2025

0cc4m commented Feb 13, 2025

daniandtheweb commented Feb 13, 2025

0cc4m commented Feb 13, 2025

daniandtheweb commented Feb 13, 2025 •

edited

Loading

0cc4m commented Feb 13, 2025 •

edited

Loading

jeffbolznv commented Feb 13, 2025

daniandtheweb commented Feb 13, 2025 •

edited

Loading

daniandtheweb commented Feb 13, 2025

0cc4m commented Feb 14, 2025

daniandtheweb commented Feb 14, 2025 •

edited

Loading

daniandtheweb commented Feb 14, 2025

vulkan: improve im2col and AMD RX 5700XT performance #11826

Are you sure you want to change the base?

vulkan: improve im2col and AMD RX 5700XT performance #11826

Conversation

daniandtheweb commented Feb 12, 2025

0cc4m commented Feb 13, 2025

daniandtheweb commented Feb 13, 2025

0cc4m commented Feb 13, 2025

daniandtheweb commented Feb 13, 2025 • edited Loading

0cc4m commented Feb 13, 2025 • edited Loading

jeffbolznv commented Feb 13, 2025

daniandtheweb commented Feb 13, 2025 • edited Loading

daniandtheweb commented Feb 13, 2025

0cc4m commented Feb 14, 2025

daniandtheweb commented Feb 14, 2025 • edited Loading

daniandtheweb commented Feb 14, 2025

daniandtheweb commented Feb 13, 2025 •

edited

Loading

0cc4m commented Feb 13, 2025 •

edited

Loading

daniandtheweb commented Feb 13, 2025 •

edited

Loading

daniandtheweb commented Feb 14, 2025 •

edited

Loading