Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: use arch list for compatibility check #11775

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #10318 (comment) .

The problem is that by default the code is being compiled for compute capabilities 5.2, 6.1, 7.0, and 7.5. A GP100 has compute capability 6.0, the minimum for FP16 intrinsics. The host code says that it can do MMV with those intrinsics but without GGML_CUDA_F16 there is no actual device code available. This PR is more of a band-aid fix that just makes GPUs with compute capability use FP32 arithmetic if the code was not compiled with GGML_CUDA_F16. Medium-term I intend to revise the handling of these intrinsics and I'll do a proper fix at that time.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Feb 9, 2025
@slaren
Copy link
Collaborator

slaren commented Feb 9, 2025

Wouldn't be possible to check if the architecture is in __CUDA_ARCH_LIST__? Maybe we should do that more often instead of assuming that certain kernels are available.

ggml/src/ggml-cuda/mmv.cu Outdated Show resolved Hide resolved
@JohannesGaessler JohannesGaessler changed the title CUDA: fix mul_mat_vec for CC 6.0 CUDA: use arch list for compatibility check Feb 10, 2025
@JohannesGaessler
Copy link
Collaborator Author

I pushed a higher-effort fix. I think the correct way to do it is to change the functions like fast_fp16_available to consider both the compute capability of the device and the compute capabilities that the code was compiled for. This can be done with a function ggml_cuda_highest_compiled_arch that returns the highest compute capability for which the code was compiled and that is still <= the device compute capability. The implementation can be done with a switch statement that uses ggml_cuda_has_arch. The compiler should be able to resolve ggml_cuda_has_arch at compile time and the switch statement is the only overhead. Whenever NVIDIA releases a new GPU the switch statement will need to be expanded but this is I think negligible in terms of maintenance effort.

I'm using Manjaro with CUDA 12.6 on my systems. For whatever reason the CUDA cross compile is broken (fails when trying to run the code) and I so far did not bother to debug why because I don't need it. So the code I pushed is not properly tested. Either someone else needs to assert that it works correctly by compiling only for compute capability 5.2 or reviewing will need to wait until I've gotten around to fixing my setup.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that's what you need, but I can compile with -DCMAKE_CUDA_ARCHITECTURES=52 and it runs.

@JohannesGaessler
Copy link
Collaborator Author

If an FP16 model or test-backend-ops -o MUL_MAT works that would be equivalent to the originally reported issue being fixed.

@slaren
Copy link
Collaborator

slaren commented Feb 10, 2025

test-backend-ops crashes when built for arch 52 only:

  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): ========= Invalid __shared__ write of size 4 bytes
=========     at void mul_mat_q<(ggml_type)2, (int)16, (int)8, (bool)1>(const char *, const char *, float *, float *, int, int, int, int, int, int, int)+0xb50
=========     by thread (0,3,0) in block (42,0,0)
=========     Address 0xfffaa9bc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x22f5d7]
=========                in /usr/lib/wsl/drivers/nv_dispig.inf_amd64_4b72fdeebb3aa5bd/libcuda.so.1.1
=========     Host Frame: [0x15aa7]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x759f0]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaError cudaLaunchKernel<char>(char*, dim3, dim3, void**, unsigned long, CUstream_st*) in /usr/local/cuda/targets/x86_64-linux/include/cuda_runtime.h:216 [0x2cc788]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:__device_stub__Z9mul_mat_qIL9ggml_type2ELi16ELi8ELb1EEvPKcS2_PfS3_iiiiiii(char const*, char const*, float*, float*, int, int, int, int, int, int, int) in /tmp/tmpxft_0006b639_00000000-6_mmq-instance-q4_0.cudafe1.stub.c:106 [0x2c1516]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:void __wrapper__device_stub_mul_mat_q<(ggml_type)2, 16, 8, true>(char const* restrict&, char const* restrict&, float* restrict&, float* restrict&, int const&, int const&, int const&, int const&, int const&, int const&, int const&) in /tmp/tmpxft_0006b639_00000000-6_mmq-instance-q4_0.cudafe1.stub.c:107 [0x2c15bd]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:void mul_mat_q<(ggml_type)2, 16, 8, true>(char const*, char const*, float*, float*, int, int, int, int, int, int, int) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2588 [0x2cc971]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:void launch_mul_mat_q<(ggml_type)2, 16>(ggml_backend_cuda_context&, mmq_args const&, CUstream_st*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2813 [0x2ceed4]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:void mul_mat_q_case<(ggml_type)2>(ggml_backend_cuda_context&, mmq_args const&, CUstream_st*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2858 [0x2d621c]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_cuda_op_mul_mat_q(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/mmq.cu:35 [0x13e164]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1617 [0x124e06]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1901 [0x12688e]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2222 [0x127d51]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:evaluate_and_capture_cuda_graph(ggml_backend_cuda_context*, ggml_cgraph*, std::vector<void*, std::allocator<void*> >&, bool&, bool&, bool&) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2655 [0x12994c]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) in /home/diego/code/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2767 [0x129ffa]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-cuda.so
=========     Host Frame:ggml_backend_graph_compute_async in /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:332 [0x66018]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-base.so
=========     Host Frame:ggml_backend_graph_compute in /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:326 [0x65fd8]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-base.so
=========     Host Frame:ggml_backend_compare_graph_backend in /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1835 [0x6be5f]
=========                in /home/diego/code/llama.cpp/build/bin/libggml-base.so
=========     Host Frame:test_case::eval(ggml_backend*, ggml_backend*, char const*) in /home/diego/code/llama.cpp/tests/test-backend-ops.cpp:562 [0x30e28]
=========                in /home/diego/code/llama.cpp/build/bin/test-backend-ops
=========     Host Frame:test_backend(ggml_backend*, test_mode, char const*) in /home/diego/code/llama.cpp/tests/test-backend-ops.cpp:4374 [0x27478]
=========                in /home/diego/code/llama.cpp/build/bin/test-backend-ops
=========     Host Frame:main in /home/diego/code/llama.cpp/tests/test-backend-ops.cpp:4490 [0x27c98]
=========                in /home/diego/code/llama.cpp/build/bin/test-backend-ops
=========     Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x2a1c9]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main in ../csu/libc-start.c:360 [0x2a28a]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x12b54]
=========                in /home/diego/code/llama.cpp/build/bin/test-backend-ops

@JohannesGaessler
Copy link
Collaborator Author

Noted, but this is 100% already an issue on master (and so far no one noticed). Does test-backend-ops -o MUL_MAT work if you run it only for FP16?

@JohannesGaessler
Copy link
Collaborator Author

Actually, if you edit ggml_cuda_should_use_mmq to use the new code you can condition the use of MMQ on the compiled CUDA arches so the issue should be fixed now.

@slaren
Copy link
Collaborator

slaren commented Feb 10, 2025

The F16 test cases and models work. With the latest commit it fails in a different case:

 MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): /home/diego/code/llama.cpp/ggml/src/ggml-cuda/convert.cu:66: ERROR: CUDA kernel dequantize_block_q8_0_f16 has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520

@JohannesGaessler
Copy link
Collaborator Author

I went through the uses of ggml_cuda_info.cc and encapsulated any checks in host code against NVIDIA compute capabilities in XY_available functions. There are only two types of exceptions. The first are in MMQ where the compute capability is also used to determine the size of tiles and whether stream-k decomposition should be used. The second are checks to determine whether some cuBLAS functionality should be used; we only link against cuBLAS so our compilation configuration is not relevant for cuBLAS performance/feature availability.

@JohannesGaessler
Copy link
Collaborator Author

Just as I had pressed enter on the previous post I realized that the logic for ggml_cuda_mul_mat was wrong because it made decisions about cuBLAS depending on llama.cpp feature availability. I added new functions fast_fp16_hardware_available and fp16_mma_hardware_available for use with external libraries.

Also I reverted an incorrect change to the logic regarding whether FlashAttention kernels are supported.

@JohannesGaessler JohannesGaessler merged commit b9ab0a4 into ggerganov:master Feb 10, 2025
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants