-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml_cuda_cpy
support for 4d tensors and float16->float32 upcasting
#686
Conversation
Alright I added tests for both the 4d copy behavior and the float16->float32 upcast. My new tests seem to pass, though probably someone else should review them.Please let me know if any more polish is needed. |
ggml_cuda_cpy
support for 4d tensorsggml_cuda_cpy
support for 4d tensors and float16->float32 upcasting
uh so one note is the 4d copy doesn't seem to work with copies from type |
(aside from the quant types, this is ready for review) |
I tried running
|
All the quantization tests pass if you replace line 4976 in
with
But then it seems that requires that the first dimension of the I also tested this PR+fix on a bert.cpp revampt I'm working on and it let's you do all the batched attention stuff on CUDA without reshaping tricks (and the numbers come out right). Edit: Spoke too soon. That fixes the CPY stuff but breaks MOE on CUDA. I guess it doesn't satisfy the divisbility criterion? |
I think that's a fair assumption. The previous implementation also divided the calculation of |
How does it break it? |
Huh, I just recompiled and ran it again and all the tests pass, incuding MOE. So I guess we're all set! |
Ah yes, the MOE test is expected to fail occasionally due to small variations that can lead to selecting different experts between the CPU and the GPU |
@balisujohn Let's apply the proposed patch and we can merge |
This PR adds support for 4d tensors to ggml_cuda_cpy. It seems to work in tortoise.cpp, but I only tested that it compiles in this codebase, so some tests may be necessary to ensure behavior is correct. As with my other PR, I wasn't able to find any existing tests for the cuda backend, so let me know how to add tests for this.
This PR also adds support for upcasting float16 to float32 in the ggml_cuda_cpy operation with the cuda backend.
Fixes this issue: #672