-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetuning 2-bit Quantized Models #115
Comments
I think we'd better wait for someone succeed in quantizing model in 2-bit without much performance loss (like QLoRA, but not sure aboute it's performance) |
So, actually, we know how to do it in two bits! We're a team of researchers at Cornell and we have working prototypes of two-bit compression that achieves good perplexity at inference time. We would like to now explore finetuning and your codebase is very helpful to us. There are actually two ways of doing it: one is a new algorithm which we are going to soon put on the ArXiv; but even the vanilla GPTQ model performs somewhat acceptably in 2 bits on the largest LLMs (check out Table 7 in their paper). Would you be interested in talking more about these experiments and exchanging code or ideas? |
Thanks for showing interest in my code! |
Hello, I have been working on similar topics (2 bits and lower). However, I have noticed that the PPL calculated using the current main branch is consistently higher than the original GPTQ-Triton. I'm interested in understanding the reasons behind this difference. Could you please provide some insights into whether this could be due to version alignment issues or other factors? I would appreciate any ideas or suggestions to further investigate this matter. Thank you, @johnsmith0031 |
@johnsmith0031 Thank you! The two-bit extension makes a lot of sense. I'm working on modifying it with custom groupsizes and I can return the code as PR if you're interested. The part that has been giving me more trouble was the 3-bit one. I'm not sure I understand the implementation well enough to figure out how to unpack and return the weight matrix in CUDA. If you happen to have played with that and you have code you could share, that would be helpful, but no worries if not. |
Not sure about the reason, but maybe it is because of the difference of matrix multiplication. I first reconstruct the matrix to float16 and use torch.matmul for matrix multiplication (but only for that case when batch_size * seq_len > 8), which is different from both original cuda and triton kernel. |
Yes 3-bit seems to be more complicated than 4-bit and 2-bit... |
Hey @johnsmith0031, thank you for this great repo! I was wondering if you tried implementing the backward pass for 2-bit or 3-bit quantized models?
I would really like to try it as an experiment. If you have any existing work on 2-bit or 3-bit autograd, I would love to contribute to it and submit a PR to this repo. Or as a first step, I could run it and share experimental results.
The text was updated successfully, but these errors were encountered: