Finetuning 2-bit Quantized Models #115

kuleshov · 2023-05-29T04:17:40Z

Hey @johnsmith0031, thank you for this great repo! I was wondering if you tried implementing the backward pass for 2-bit or 3-bit quantized models?

I would really like to try it as an experiment. If you have any existing work on 2-bit or 3-bit autograd, I would love to contribute to it and submit a PR to this repo. Or as a first step, I could run it and share experimental results.

johnsmith0031 · 2023-05-29T05:31:54Z

I think we'd better wait for someone succeed in quantizing model in 2-bit without much performance loss (like QLoRA, but not sure aboute it's performance)

kuleshov · 2023-05-29T18:54:52Z

So, actually, we know how to do it in two bits! We're a team of researchers at Cornell and we have working prototypes of two-bit compression that achieves good perplexity at inference time. We would like to now explore finetuning and your codebase is very helpful to us.

There are actually two ways of doing it: one is a new algorithm which we are going to soon put on the ArXiv; but even the vanilla GPTQ model performs somewhat acceptably in 2 bits on the largest LLMs (check out Table 7 in their paper).

Would you be interested in talking more about these experiments and exchanging code or ideas?

johnsmith0031 · 2023-05-30T02:34:07Z

Thanks for showing interest in my code!
I add 2bit reconstruction functions to the cuda kernel in another branch. I think you can adjust the code if needed.
https://github.com/johnsmith0031/alpaca_lora_4bit/tree/2bit_support

NicoNico6 · 2023-05-30T11:45:58Z

Hello, I have been working on similar topics (2 bits and lower). However, I have noticed that the PPL calculated using the current main branch is consistently higher than the original GPTQ-Triton. I'm interested in understanding the reasons behind this difference. Could you please provide some insights into whether this could be due to version alignment issues or other factors? I would appreciate any ideas or suggestions to further investigate this matter. Thank you, @johnsmith0031

kuleshov · 2023-05-31T00:59:17Z

@johnsmith0031 Thank you! The two-bit extension makes a lot of sense. I'm working on modifying it with custom groupsizes and I can return the code as PR if you're interested.

The part that has been giving me more trouble was the 3-bit one. I'm not sure I understand the implementation well enough to figure out how to unpack and return the weight matrix in CUDA. If you happen to have played with that and you have code you could share, that would be helpful, but no worries if not.

johnsmith0031 · 2023-05-31T01:44:03Z

Hello, I have been working on similar topics (2 bits and lower). However, I have noticed that the PPL calculated using the current main branch is consistently higher than the original GPTQ-Triton. I'm interested in understanding the reasons behind this difference. Could you please provide some insights into whether this could be due to version alignment issues or other factors? I would appreciate any ideas or suggestions to further investigate this matter. Thank you, @johnsmith0031

Not sure about the reason, but maybe it is because of the difference of matrix multiplication. I first reconstruct the matrix to float16 and use torch.matmul for matrix multiplication (but only for that case when batch_size * seq_len > 8), which is different from both original cuda and triton kernel.

johnsmith0031 · 2023-05-31T01:48:23Z

@johnsmith0031 Thank you! The two-bit extension makes a lot of sense. I'm working on modifying it with custom groupsizes and I can return the code as PR if you're interested.

The part that has been giving me more trouble was the 3-bit one. I'm not sure I understand the implementation well enough to figure out how to unpack and return the weight matrix in CUDA. If you happen to have played with that and you have code you could share, that would be helpful, but no worries if not.

Yes 3-bit seems to be more complicated than 4-bit and 2-bit...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning 2-bit Quantized Models #115

Finetuning 2-bit Quantized Models #115

kuleshov commented May 29, 2023

johnsmith0031 commented May 29, 2023

kuleshov commented May 29, 2023

johnsmith0031 commented May 30, 2023

NicoNico6 commented May 30, 2023

kuleshov commented May 31, 2023

johnsmith0031 commented May 31, 2023

johnsmith0031 commented May 31, 2023

Finetuning 2-bit Quantized Models #115

Finetuning 2-bit Quantized Models #115

Comments

kuleshov commented May 29, 2023

johnsmith0031 commented May 29, 2023

kuleshov commented May 29, 2023

johnsmith0031 commented May 30, 2023

NicoNico6 commented May 30, 2023

kuleshov commented May 31, 2023

johnsmith0031 commented May 31, 2023

johnsmith0031 commented May 31, 2023