[Inference]Move quantization code from run_finetune.py to run_quantiz…

…ation.py (#9450) * 1. Move quantization code from run_finetune.py to run_quantization.py 2. Remove experimental code about quantization in llm/experimental. These code will merge in Paddleslim. * repair experimental/ceval * update readme and remove useless code * add test for run_quantization * update wrong comment * update qwen2 fp8 quantization config
PaddlePaddle · Nov 22, 2024 · 9494e9a · 9494e9a
1 parent 7bfe5bc
commit 9494e9a
Show file tree

Hide file tree

Showing 16 changed files with 614 additions and 1,042 deletions.
diff --git a/llm/README.md b/llm/README.md
@@ -228,16 +228,16 @@ python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/dpo
 
 ```shell
 # PTQ 量化启动命令参考
-python run_finetune.py ./config/llama/ptq_argument.json
+python run_quantization.py ./config/llama/ptq_argument.json
 
 # GPTQ 量化启动命令参考
-python run_finetune.py ./config/llama/ptq_argument.json
+python run_quantization.py ./config/llama/gptq_argument.json
 
 # W8A8C8(INT)量化启动命令参考
-python run_finetune.py ./config/llama/ptq_c8_argument.json
+python run_quantization.py ./config/llama/ptq_c8_argument.json
 
 # W8A8(FP8)量化启动命令参考
-python run_finetune.py ./config/llama/fp8_ptq_argument.json
+python run_quantization.py ./config/llama/fp8_ptq_argument.json
 ```
 
 更多技术细节和模型量化使用详见[量化文档](./docs/quantization.md)。

diff --git a/llm/config/qwen/AdvertiseGen/wfp8afp8_ptq_argument.json b/llm/config/qwen/AdvertiseGen/wfp8afp8_ptq_argument.json
@@ -17,6 +17,5 @@
   "unified_checkpoint": false,
   "smooth": false,
   "weight_quant_method": "abs_max",
-  "act_quant_method": "abs_max",
-  "skip_list_names": ["down_proj"]
+  "act_quant_method": "abs_max"
   }
diff --git a/llm/docs/quantization.md b/llm/docs/quantization.md
@@ -67,31 +67,31 @@ python prepare_data_for_ptq.py
 ### 2.3 PTQ 量化
 
 ```shell
-python  run_finetune.py ./config/llama/ptq_argument.json
+python  run_quantization.py ./config/llama/ptq_argument.json
 ```
 
 ### 2.4 GPTQ 量化
 
 ```shell
-python  run_finetune.py ./config/llama/gptq_argument.json
+python  run_quantization.py ./config/llama/gptq_argument.json
 ```
 
 ### 2.5 AWQ 量化
 
 ```shell
-python  run_finetune.py ./config/llama/awq_argument.json
+python  run_quantization.py ./config/llama/awq_argument.json
 ```
 
 ### 2.6 W8A8C8(INT8)量化
 
 ```shell
-python  run_finetune.py ./config/llama/ptq_c8_argument.json
+python  run_quantization.py ./config/llama/ptq_c8_argument.json
 ```
 
 ### 2.7 W8A8(FP8)量化
 
 ```shell
-python  run_finetune.py ./config/llama/fp8_ptq_argument.json
+python  run_quantization.py ./config/llama/fp8_ptq_argument.json
 ```
 
 ### 2.8 量化参数介绍

diff --git a/llm/experimental/layers/cache_kv.py b/llm/experimental/layers/cache_kv.py
diff --git a/llm/experimental/layers/custom_attention.py b/llm/experimental/layers/custom_attention.py