-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flashinfer shrink vs cutlass #25
Comments
Thanks for taking a close look! We'll deprecate the cutlass implementation in the future. See discussions here: #2 |
Makes sense, thanks! So it seems like the recommendation would be to use the hand written version for shrink: https://github.com/punica-ai/punica/blob/master/csrc/sgmv_flashinfer/sgmv_flashinfer.cuh and in the meantime use the cutlass based version for expand |
Correct. Once we got time to push out custom expand, we'll deprecate cutlass. You can use Related: #11 |
Sounds great, thanks! |
@abcdabcd987 Can't wait to the customized version. So far we use the current version in production and performance seems good for multi-lora deployment. |
@jcao-ai Glad that Punica got deployed and serves your usage :) |
Hi, I really enjoyed learning about SGMV.
I was grokking through the code and wanted to check my understanding. It seems that there are two implementations of SGMV, one based on Grouped GEMM cutlass and another hand written one (using some utils from flashinfer). Just wondering, what is the performance benchmark between the two?
The text was updated successfully, but these errors were encountered: