Skip to content
/ RLQ Public

[ACL 2024] (Oral) LRQuant: Learnable and Robust Post-Training Quantization for Large Language Models

Notifications You must be signed in to change notification settings

zjq0455/RLQ

Repository files navigation

LRQuant: Learnable and Robust Post-Training Quantization for Large Language Models

We propose a novel post-training quantization method for large language models with learnable parameters, novel loss function and Test-time adaptation scheme.

Post-training quantization (PTQ) for large language models (LLMs) significantly accelerates model inference and relieves memory constraints, without incurring model training. A ``smoothing paradigm'' is commonly used in LLM quantization, which transfers the quantization difficulty of activation to weight quantization using mathematically equivalent transformations. However, existing methods face two issues: 1) Most smoothing parameters are hand-crafted defined which leads to suboptimal results; 2) There are significant performance degradations when tested on unseen datasets. To address these challenges, this paper introduces a robust learnable smooth-based PTQ framework, called LRQuant. Firstly, we consider a learnable paradigm to find optimal smoothing parameters which are initialized by logarithmic activation equivalent. In addition, we empirically found that only relying on MSE loss could hardly lead to optimal quantization results, and we then propose a novel loss function based on the negative logarithm of cosine similarity (NLC loss) between outputs of full-precision and quantized block. At last, we pioneeringly introduce Test-time adaptation (TTA) into LLM quantization, which allows for rapid model adaptation during testing to improve generalization performance. More surprisingly, we find that by using our TTA method, we can achieve better results on test sets than directly using test sets for calibration in some cases while avoiding catastrophic forgetting.

Usage

We provide full script to run RLQuant. We use llama-7b as an example here:

  1. Obtain the channel-wise scales and shifts required for initialization:
python generate_act_scale_shift.py --model /PATH/TO/llama/llama-7b
  1. Weight-activation quantization
# W4A4 ppl
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/llama/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--eval_ppl --wbits 4 --abits 4 --lwc --let \

# W4A4 zero-shot
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/llama/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--wbits 4 --abits 4 --lwc --let \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

# W4A4 tta
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/llama/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--eval_ppl --wbits 4 --abits 4 --lwc --let --tta\

Related Project

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

About

[ACL 2024] (Oral) LRQuant: Learnable and Robust Post-Training Quantization for Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages