Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning 2-bit Quantized Models #115

Open
kuleshov opened this issue May 29, 2023 · 7 comments
Open

Finetuning 2-bit Quantized Models #115

kuleshov opened this issue May 29, 2023 · 7 comments

Comments

@kuleshov
Copy link

Hey @johnsmith0031, thank you for this great repo! I was wondering if you tried implementing the backward pass for 2-bit or 3-bit quantized models?

I would really like to try it as an experiment. If you have any existing work on 2-bit or 3-bit autograd, I would love to contribute to it and submit a PR to this repo. Or as a first step, I could run it and share experimental results.

@johnsmith0031
Copy link
Owner

I think we'd better wait for someone succeed in quantizing model in 2-bit without much performance loss (like QLoRA, but not sure aboute it's performance)

@kuleshov
Copy link
Author

So, actually, we know how to do it in two bits! We're a team of researchers at Cornell and we have working prototypes of two-bit compression that achieves good perplexity at inference time. We would like to now explore finetuning and your codebase is very helpful to us.

There are actually two ways of doing it: one is a new algorithm which we are going to soon put on the ArXiv; but even the vanilla GPTQ model performs somewhat acceptably in 2 bits on the largest LLMs (check out Table 7 in their paper).

Would you be interested in talking more about these experiments and exchanging code or ideas?

@johnsmith0031
Copy link
Owner

Thanks for showing interest in my code!
I add 2bit reconstruction functions to the cuda kernel in another branch. I think you can adjust the code if needed.
https://github.com/johnsmith0031/alpaca_lora_4bit/tree/2bit_support

@NicoNico6
Copy link

Hello, I have been working on similar topics (2 bits and lower). However, I have noticed that the PPL calculated using the current main branch is consistently higher than the original GPTQ-Triton. I'm interested in understanding the reasons behind this difference. Could you please provide some insights into whether this could be due to version alignment issues or other factors? I would appreciate any ideas or suggestions to further investigate this matter. Thank you, @johnsmith0031

@kuleshov
Copy link
Author

@johnsmith0031 Thank you! The two-bit extension makes a lot of sense. I'm working on modifying it with custom groupsizes and I can return the code as PR if you're interested.

The part that has been giving me more trouble was the 3-bit one. I'm not sure I understand the implementation well enough to figure out how to unpack and return the weight matrix in CUDA. If you happen to have played with that and you have code you could share, that would be helpful, but no worries if not.

@johnsmith0031
Copy link
Owner

Hello, I have been working on similar topics (2 bits and lower). However, I have noticed that the PPL calculated using the current main branch is consistently higher than the original GPTQ-Triton. I'm interested in understanding the reasons behind this difference. Could you please provide some insights into whether this could be due to version alignment issues or other factors? I would appreciate any ideas or suggestions to further investigate this matter. Thank you, @johnsmith0031

Not sure about the reason, but maybe it is because of the difference of matrix multiplication. I first reconstruct the matrix to float16 and use torch.matmul for matrix multiplication (but only for that case when batch_size * seq_len > 8), which is different from both original cuda and triton kernel.

@johnsmith0031
Copy link
Owner

@johnsmith0031 Thank you! The two-bit extension makes a lot of sense. I'm working on modifying it with custom groupsizes and I can return the code as PR if you're interested.

The part that has been giving me more trouble was the 3-bit one. I'm not sure I understand the implementation well enough to figure out how to unpack and return the weight matrix in CUDA. If you happen to have played with that and you have code you could share, that would be helpful, but no worries if not.

Yes 3-bit seems to be more complicated than 4-bit and 2-bit...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants