-
Notifications
You must be signed in to change notification settings - Fork 577
inference_with_transformers_en
Ziqing Yang edited this page Aug 1, 2023
·
13 revisions
We provide two scripts to use the native Transformers for inference: a command-line interface and a web graphical interface.
Taking the loading of the Chinese-LLaMA-2-7B/Chinese-Alpaca-2-7B model as an example:
python scripts/inference/inference_hf.py \
--base_model path_to_original_llama_2_hf_dir \
--lora_model path_to_chinese_llama2_or_alpaca2_lora \
--with_prompt \
--interactive
If you have already merged the models with merge_llama2_with_chinese_lora_low_mem.py
, you don't need to specify --lora_model
:
python scripts/inference/inference_hf.py \
--base_model path_to_merged_llama2_or_alpaca2_hf_dir \
--with_prompt \
--interactive
Parameter description:
-
--base_model {base_model}
: Directory containing the LLaMA-2 model weights and configuration files in HF format. -
--lora_model {lora_model}
: Directory of the Chinese LLaMA-2/Alpaca-2 LoRA files after decompression, or the 🤗Model Hub model name. If this parameter is not provided, only the model specified by--base_model
will be loaded. -
--tokenizer_path {tokenizer_path}
: Directory containing the corresponding tokenizer. If this parameter is not provided, its default value is the same as--lora_model
; if the--lora_model
parameter is not provided either, its default value is the same as--base_model
. -
--with_prompt
: Whether to merge the input with the prompt template. If you are loading an Alpaca model, be sure to enable this option! -
--interactive
: Launch interactively for multiple single-round question-answer sessions (this is not the contextual dialogue in llama.cpp). -
--data_file {file_name}
: In non-interactive mode, read the content offile_name
line by line for prediction. -
--predictions_file {file_name}
: In non-interactive mode, write the predicted results in JSON format tofile_name
. -
--use_cpu
: Only use CPU for inference. -
--gpus {gpu_ids}
: the GPU id(s) to use, default 0. You can specify multiple GPUs, for instance0,1,2
.
-
--alpha {alpha}
:The coefficient in NTK scaling method, which can effectively increase the max context size. Default value is 1. If you do not konw how to set the parameter, just leave it default, or set it to"auto"
. -
--load_in_8bit
:Load the model in the 8bit mode. -
--system_prompt {system_prompt}
: Set system prompt. The default value is the string in alpaca-2.txt.
This method will start a web frontend page for interaction and support multi-turn conversations. In addition to Transformers
, you need to install Gradio
and mdtex2html
:
pip install gradio
pip install mdtex2html
The launch command:
python scripts/inference/gradio_demo.py \
--base_model path_to_original_llama_2_hf_dir \
--lora_model path_to_chinese_alpaca2_lora
If you have already merged the LoRA weights with merge_llama2_with_chinese_lora_low_mem.py
, you don't need to specify --lora_model
:
python scripts/inference/gradio_demo.py --base_model path_to_merged_alpaca2_hf_dir
Parameter description:
-
--base_model {base_model}
: Directory containing the LLaMA-2 model weights and configuration files in HF format. -
--lora_model {lora_model}
: Directory of the Chinese LLaMA-2/Alpaca-2 LoRA files after decompression, or the 🤗Model Hub model name. If this parameter is not provided, only the model specified by--base_model
will be loaded. -
--tokenizer_path {tokenizer_path}
: Directory containing the corresponding tokenizer. If this parameter is not provided, its default value is the same as--lora_model
; if the--lora_model
parameter is not provided either, its default value is the same as--base_model
. -
--use_cpu
: Only use CPU for inference. -
--gpus {gpu_ids}
: the GPU id(s) to use, default 0. You can specify multiple GPUs, for instance0,1,2
. -
--alpha {alpha}
:The coefficient in NTK scaling method, which can effectively increase the max context size. Default value is 1. If you do not konw how to set the parameter, just leave it default, or set it to"auto"
. -
--load_in_8bit
:Load the model in the 8bit mode. -
--max_memory
: The max number of history tokens to keep in the multi-turn dialogue. Default value is 1024. -
--system_prompt {system_prompt}
: Set system prompt. The default value is the string in alpaca-2.txt.
- Due to differences in decoding implementation details between different frameworks, this script cannot guarantee to reproduce the decoding effect of llama.cpp.
- This script is for convenient and quick experience only, and has not been optimized for fast inference.
- When running 7B model inference on a CPU, make sure you have 32GB of memory; when running 7B model inference on a single GPU, make sure you have 16GB VRAM.