-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move quantization after LoRA surgery #828
Conversation
linear_q4bit.bias = torch.nn.Parameter(bias_data) | ||
|
||
# Replace the original linear layer with the quantized one | ||
setattr(parent_module, layer_name, linear_q4bit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @caroteu,
I see. I checked it out, and I realized there were some minor details I missed as well (eg. quantizing only the linear layers for A and B matrices). With the latest updates, the training with mixed precision does not work anymore for me! (let me know if that's not the case for you) And regarding the inference loading scripts, we should probably discuss it on Monday. I am a bit skeptic about the current additions! |
Okay. I think I figured it out. The issue is that the module layers for forward and backward pass should align with each other (i.e. in both passes). And the safest way to do this is by taking care of using the right modules in the And thanks to @caroteu, we have a remarkable improvement in memory efficiency now! 🎉 I'll push my commits after confirming a few epochs! PS. We should still discuss the inference changes you made because I think we don't need it. As mentioned, let's talk about the details on Monday! |
I'll drop this, since we clarified a few confusions. We will work on finalizing QLoRA in another PR! |
@anwai98 I noticed that the way we we doing quantization, we did not quantize the LoRA matrices (A and B matrices). Was that intended? If not this should be fixed by adding the quantization after doing the LoRA surgery.
I also added code to initialize the quantized model for inference.