This document introduces:
- The steps to install the TensorRT-LLM quantization toolkit.
- The Python APIs to quantize the models.
The detailed LLM quantization recipe is distributed to the of the corresponding model examples.
- If the dev environment is a docker container, please launch the docker with the following flags
docker run --gpus all --ipc=host --ulimit memlock=-1 --shm-size=20g -it <the docker image with TensorRT-LLM installed> bash
- Install the quantization toolkit
and the related dependencies on top of the TensorRT-LLM installation or docker file.
# Obtain the cuda version from the system. Assuming nvcc is available in path.
cuda_version=$(nvcc --version | grep 'release' | awk '{print $6}' | awk -F'[V.]' '{print $2$3}')
# Obtain the python version from the system.
python_version=$(python3 --version 2>&1 | awk '{print $2}' | awk -F. '{print $1$2}')
# Download and install the AMMO package from the DevZone.
tar -xzf nvidia_ammo-0.3.0.tar.gz
pip install nvidia_ammo-0.3.0/nvidia_ammo-0.3.0+cu$cuda_version-cp$python_version-cp$python_version-linux_x86_64.whl
# Install the additional requirements
cd <this example folder>
pip install -r requirements.txt
uses the quantization toolkit to calibrate the PyTorch models, and generate a model config, saved as a json (for the model structure) and npz files (for the model weights) that TensorRT-LLM could parse. The model config includes everything needed by TensorRT-LLM to build the TensorRT inference engine, as explained below.
This quantization step may take a long time to finish and requires large GPU memory. Please use a server grade GPU if a GPU out-of-memory error occurs
If the model is trained with multi-GPU with tensor parallelism, the PTQ calibration process requires the same amount of GPUs as the training time too.
PTQ can be achieved with simple calibration on a small set of training or evaluation data (typically 128-512 samples) after converting a regular PyTorch model to a quantized model.
import ammo.torch.quantization as atq
model = AutoModelForCausalLM.from_pretrained("...")
# Select the quantization config, for example, FP8
config = atq.FP8_DEFAULT_CFG
# Prepare the calibration set and define a forward loop
def forward_loop():
for data in calib_set:
# PTQ with in-place replacement to quantized modules
with torch.no_grad():
atq.quantize(model, config, forward_loop)
After the model is quantized, the model config can be stored. The model config files include all the information needed by TensorRT-LLM to generate the deployable engine, including the quantized scaling factors.
The exported model config are stored as
- A single JSON file recording the model structure and metadata and
- A group of npz files each recording the model on a single tensor parallel rank (model weights, scaling factors per GPU).
The export API is
from ammo.torch.export import export_model_config
with torch.inference_mode():
model, # The quantized model.
decoder_type, # The type of the model as str, e.g gptj, llama or gptnext.
dtype, # The exported weights data type as torch.dtype.
quantization, # The quantization algorithm applied, e.g. fp8 or int8_sq.
export_dir, # The directory where the exported files will be stored.
inference_gpus, # The number of GPUs used in the inference time for tensor parallelism.