Quantization

Quantization refers to processes that enable lower precision inference and training by performing computations at fixed point integers that are lower than floating points. This often leads to smaller model sizes and faster inference time. Quantization is particularly useful in deep learning inference and training, where moving data more quickly and reducing bandwidth bottlenecks is optimal. Intel is actively working on techniques that use lower numerical precision by using training with 16-bit multipliers and inference with 8-bit or 16-bit multipliers. Refer to the Intel article on lower numerical precision inference and training in deep learning.

Quantization methods include the following three classes:

Post-Training Quantization (PTQ)
Quantization-Aware Training (QAT)
Dynamic Quantization

Intel® Low Precision Optimization Tool currently supports PTQ and QAT. Using MobileNetV2 as an example, this document provides tutorials for both. It also provides helper functions for evaluation.

Dynamic Quantization currently is only supported with onnxruntime backend, please refer to dynamic quantization for details.

Note: These quantization tutorials use PyTorch examples as allowed by PyTorch's License. Refer to PyTorch for updates.

See also PTQ
See also QAT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization.md

Quantization.md

Quantization

Files

Quantization.md

Latest commit

History

Quantization.md

File metadata and controls

Quantization