LRQuant: Learnable and Robust Post-Training Quantization for Large Language Models

We propose a novel post-training quantization method for large language models with learnable parameters, novel loss function and Test-time adaptation scheme.

Post-training quantization (PTQ) for large language models (LLMs) significantly accelerates model inference and relieves memory constraints, without incurring model training. A ``smoothing paradigm'' is commonly used in LLM quantization, which transfers the quantization difficulty of activation to weight quantization using mathematically equivalent transformations. However, existing methods face two issues: 1) Most smoothing parameters are hand-crafted defined which leads to suboptimal results; 2) There are significant performance degradations when tested on unseen datasets. To address these challenges, this paper introduces a robust learnable smooth-based PTQ framework, called LRQuant. Firstly, we consider a learnable paradigm to find optimal smoothing parameters which are initialized by logarithmic activation equivalent. In addition, we empirically found that only relying on MSE loss could hardly lead to optimal quantization results, and we then propose a novel loss function based on the negative logarithm of cosine similarity (NLC loss) between outputs of full-precision and quantized block. At last, we pioneeringly introduce Test-time adaptation (TTA) into LLM quantization, which allows for rapid model adaptation during testing to improve generalization performance. More surprisingly, we find that by using our TTA method, we can achieve better results on test sets than directly using test sets for calibration in some cases while avoiding catastrophic forgetting.

Usage

We provide full script to run RLQuant. We use llama-7b as an example here:

Obtain the channel-wise scales and shifts required for initialization:

python generate_act_scale_shift.py --model /PATH/TO/llama/llama-7b

Weight-activation quantization

# W4A4 ppl
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/llama/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--eval_ppl --wbits 4 --abits 4 --lwc --let \

# W4A4 zero-shot
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/llama/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--wbits 4 --abits 4 --lwc --let \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

# W4A4 tta
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/llama/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--eval_ppl --wbits 4 --abits 4 --lwc --let --tta\

Related Project

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
imgs		imgs
lm_eval		lm_eval
models		models
quantize		quantize
.gitignore		.gitignore
README.md		README.md
categories.py		categories.py
datautils.py		datautils.py
generate_act_scale_shift.py		generate_act_scale_shift.py
main.py		main.py
parallel_utils.py		parallel_utils.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LRQuant: Learnable and Robust Post-Training Quantization for Large Language Models

Usage

Related Project

About

Releases

Packages

Contributors 2

Languages

zjq0455/RLQ

Folders and files

Latest commit

History

Repository files navigation

LRQuant: Learnable and Robust Post-Training Quantization for Large Language Models

Usage

Related Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages