CUDA: use arch list for compatibility check #11775

JohannesGaessler · 2025-02-09T18:35:01Z

The problem is that by default the code is being compiled for compute capabilities 5.2, 6.1, 7.0, and 7.5. A GP100 has compute capability 6.0, the minimum for FP16 intrinsics. The host code says that it can do MMV with those intrinsics but without GGML_CUDA_F16 there is no actual device code available. This PR is more of a band-aid fix that just makes GPUs with compute capability use FP32 arithmetic if the code was not compiled with GGML_CUDA_F16. Medium-term I intend to revise the handling of these intrinsics and I'll do a proper fix at that time.

slaren · 2025-02-09T18:39:20Z

Wouldn't be possible to check if the architecture is in __CUDA_ARCH_LIST__? Maybe we should do that more often instead of assuming that certain kernels are available.

ggml/src/ggml-cuda/mmv.cu

JohannesGaessler · 2025-02-10T14:02:40Z

I pushed a higher-effort fix. I think the correct way to do it is to change the functions like fast_fp16_available to consider both the compute capability of the device and the compute capabilities that the code was compiled for. This can be done with a function ggml_cuda_highest_compiled_arch that returns the highest compute capability for which the code was compiled and that is still <= the device compute capability. The implementation can be done with a switch statement that uses ggml_cuda_has_arch. The compiler should be able to resolve ggml_cuda_has_arch at compile time and the switch statement is the only overhead. Whenever NVIDIA releases a new GPU the switch statement will need to be expanded but this is I think negligible in terms of maintenance effort.

I'm using Manjaro with CUDA 12.6 on my systems. For whatever reason the CUDA cross compile is broken (fails when trying to run the code) and I so far did not bother to debug why because I don't need it. So the code I pushed is not properly tested. Either someone else needs to assert that it works correctly by compiling only for compute capability 5.2 or reviewing will need to wait until I've gotten around to fixing my setup.

slaren

Not sure if that's what you need, but I can compile with -DCMAKE_CUDA_ARCHITECTURES=52 and it runs.

JohannesGaessler · 2025-02-10T14:19:04Z

If an FP16 model or test-backend-ops -o MUL_MAT works that would be equivalent to the originally reported issue being fixed.

slaren · 2025-02-10T14:25:14Z

test-backend-ops crashes when built for arch 52 only:

  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): ========= Invalid __shared__ write of size 4 bytes
=========     at void mul_mat_q<(ggml_type)2, (int)16, (int)8, (bool)1>(const char *, const char *, float *, float *, int, int, int, int, int, int, int)+0xb50
=========     by thread (0,3,0) in block (42,0,0)
=========     Address 0xfffaa9bc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x22f5d7]
=========                in /usr/lib/wsl/drivers/nv_dispig.inf_amd64_4b72fdeebb3aa5bd/libcuda.so.1.1
=========     Host Frame: [0x15aa7]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x759f0]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaError cudaLaunchKernel<char>(char*, dim3, dim3, void**, unsigned long, CUstream_st*) in /usr/local/cuda/targets/x86_64-linux/include/cuda_runtime.h:216 [0x2cc788]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:__device_stub__Z9mul_mat_qIL9ggml_type2ELi16ELi8ELb1EEvPKcS2_PfS3_iiiiiii(char const*, char const*, float*, float*, int, int, int, int, int, int, int) in /tmp/tmpxft_0006b639_00000000-6_mmq-instance-q4_0.cudafe1.stub.c:106 [0x2c1516]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:void __wrapper__device_stub_mul_mat_q<(ggml_type)2, 16, 8, true>(char const* restrict&, char const* restrict&, float* restrict&, float* restrict&, int const&, int const&, int const&, int const&, int const&, int const&, int const&) in /tmp/tmpxft_0006b639_00000000-6_mmq-instance-q4_0.cudafe1.stub.c:107 [0x2c15bd]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:void mul_mat_q<(ggml_type)2, 16, 8, true>(char const*, char const*, float*, float*, int, int, int, int, int, int, int) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2588 [0x2cc971]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:void launch_mul_mat_q<(ggml_type)2, 16>(ggml_backend_cuda_context&, mmq_args const&, CUstream_st*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2813 [0x2ceed4]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:void mul_mat_q_case<(ggml_type)2>(ggml_backend_cuda_context&, mmq_args const&, CUstream_st*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2858 [0x2d621c]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_cuda_op_mul_mat_q(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/mmq.cu:35 [0x13e164]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1617 [0x124e06]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1901 [0x12688e]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2222 [0x127d51]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:evaluate_and_capture_cuda_graph(ggml_backend_cuda_context*, ggml_cgraph*, std::vector<void*, std::allocator<void*> >&, bool&, bool&, bool&) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2655 [0x12994c]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2767 [0x129ffa]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_backend_graph_compute_async in /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:332 [0x66018]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-base.so
=========     Host Frame:ggml_backend_graph_compute in /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:326 [0x65fd8]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-base.so
=========     Host Frame:ggml_backend_compare_graph_backend in /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1835 [0x6be5f]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-base.so
=========     Host Frame:test_case::eval(ggml_backend*, ggml_backend*, char const*) in /home/diego/code/llama.cpp/tests/test-backend-ops.cpp:562 [0x30e28]
=========                in /home/diego/code/llama.cpp/build/bin/test-backend-ops
=========     Host Frame:test_backend(ggml_backend*, test_mode, char const*) in /home/diego/code/llama.cpp/tests/test-backend-ops.cpp:4374 [0x27478]
=========                in /home/diego/code/llama.cpp/build/bin/test-backend-ops
=========     Host Frame:main in /home/diego/code/llama.cpp/tests/test-backend-ops.cpp:4490 [0x27c98]
=========                in /home/diego/code/llama.cpp/build/bin/test-backend-ops
=========     Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x2a1c9]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main in ../csu/libc-start.c:360 [0x2a28a]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x12b54]
=========                in /home/diego/code/llama.cpp/build/bin/test-backend-ops

JohannesGaessler · 2025-02-10T14:47:38Z

Noted, but this is 100% already an issue on master (and so far no one noticed). Does test-backend-ops -o MUL_MAT work if you run it only for FP16?

JohannesGaessler · 2025-02-10T14:54:12Z

Actually, if you edit ggml_cuda_should_use_mmq to use the new code you can condition the use of MMQ on the compiled CUDA arches so the issue should be fixed now.

slaren · 2025-02-10T14:56:05Z

The F16 test cases and models work. With the latest commit it fails in a different case:

 MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): /home/diego/code/llama.cpp/ggml/src/ggml-cuda/convert.cu:66: ERROR: CUDA kernel dequantize_block_q8_0_f16 has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520

JohannesGaessler · 2025-02-10T17:18:53Z

I went through the uses of ggml_cuda_info.cc and encapsulated any checks in host code against NVIDIA compute capabilities in XY_available functions. There are only two types of exceptions. The first are in MMQ where the compute capability is also used to determine the size of tiles and whether stream-k decomposition should be used. The second are checks to determine whether some cuBLAS functionality should be used; we only link against cuBLAS so our compilation configuration is not relevant for cuBLAS performance/feature availability.

JohannesGaessler · 2025-02-10T17:43:25Z

Just as I had pressed enter on the previous post I realized that the logic for ggml_cuda_mul_mat was wrong because it made decisions about cuBLAS depending on llama.cpp feature availability. I added new functions fast_fp16_hardware_available and fp16_mma_hardware_available for use with external libraries.

Also I reverted an incorrect change to the logic regarding whether FlashAttention kernels are supported.

ggml/src/ggml-cuda/common.cuh

Co-authored-by: Diego Devesa <[email protected]>

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Feb 9, 2025

JohannesGaessler mentioned this pull request Feb 9, 2025

CUDA: remove DMMV, consolidate F16 mult mat vec #10318

Merged

slaren reviewed Feb 9, 2025

View reviewed changes

ggml/src/ggml-cuda/mmv.cu Outdated Show resolved Hide resolved

JohannesGaessler force-pushed the cuda-fix-pascal-mmv branch from c80a441 to 23ef203 Compare February 10, 2025 13:40

JohannesGaessler changed the title ~~CUDA: fix mul_mat_vec for CC 6.0~~ CUDA: use arch list for compatibility check Feb 10, 2025

JohannesGaessler force-pushed the cuda-fix-pascal-mmv branch from 23ef203 to 7ae0912 Compare February 10, 2025 13:57

slaren approved these changes Feb 10, 2025

View reviewed changes

JohannesGaessler force-pushed the cuda-fix-pascal-mmv branch from 7ae0912 to 165edb3 Compare February 10, 2025 14:53

CUDA: use arch list for feature availability check

d55e584

JohannesGaessler force-pushed the cuda-fix-pascal-mmv branch from 165edb3 to d55e584 Compare February 10, 2025 17:14

JohannesGaessler added 2 commits February 10, 2025 18:29

add xy_hardware_available

d7e36b7

revert supports_op change

33c94ca

slaren reviewed Feb 10, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

Update ggml/src/ggml-cuda/common.cuh

03ac153

Co-authored-by: Diego Devesa <[email protected]>

JohannesGaessler merged commit b9ab0a4 into ggerganov:master Feb 10, 2025
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: use arch list for compatibility check #11775

CUDA: use arch list for compatibility check #11775

JohannesGaessler commented Feb 9, 2025

slaren commented Feb 9, 2025

JohannesGaessler commented Feb 10, 2025

slaren left a comment

JohannesGaessler commented Feb 10, 2025

slaren commented Feb 10, 2025

JohannesGaessler commented Feb 10, 2025

JohannesGaessler commented Feb 10, 2025

slaren commented Feb 10, 2025

JohannesGaessler commented Feb 10, 2025

JohannesGaessler commented Feb 10, 2025

CUDA: use arch list for compatibility check #11775

CUDA: use arch list for compatibility check #11775

Conversation

JohannesGaessler commented Feb 9, 2025

slaren commented Feb 9, 2025

JohannesGaessler commented Feb 10, 2025

slaren left a comment

Choose a reason for hiding this comment

JohannesGaessler commented Feb 10, 2025

slaren commented Feb 10, 2025

JohannesGaessler commented Feb 10, 2025

JohannesGaessler commented Feb 10, 2025

slaren commented Feb 10, 2025

JohannesGaessler commented Feb 10, 2025

JohannesGaessler commented Feb 10, 2025