[Feature Request] Support more popular compression algorithms and highly optimized kernels on the CPU #264

yiliu30 · 2023-10-10T05:30:55Z

xTuring, is known for its efficient and straightforward fine-tuning support for popular LLMs, but it missing features related to popular compression algorithms, particularly weight-only quantization. These compression methods have been widely acknowledged for their efficiency and are commonly adopted in the industry. Furthermore, xTuring has limited support for CPU-side optimization.
Our team developed these quantization algorithms on Intel® Neural Compressor and Intel-Extension-for-Transformers. We want to integrate these algorithms into the xTuring. This integration aims to:

Offer weight-only quantization algorithms, including RTN, AWQ, and TEQ with ignorable accuracy loss
Provide optimized kernels for accelerating inference, especially on Intel CPUs

Usage

from xturing.models import BaseModel

# Sepcific the quantizatuion configuration
from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
woq_config = WeightOnlyQuantConfig(weight_dtype='int8')
model = BaseModel.create("gpt2", quantization_config=woq_config)

# Inference model with itrex's highly optimized kernels
output = model.generate(texts=["Why are the LLM models important?"])

Supported Scope

All currently supported models.

Plan

Initial draft PR supporting gpt2 for a quick review of the implementation details
Extend support to all models
UT and CI
Doc

@StochasticRomanAgeev @tushar2407

StochasticRomanAgeev · 2023-10-30T08:10:03Z

Hi @yiliu30,
Thanks for pr and interest!
First question, what is better in this approach than our already supported int8 version of models?

yiliu30 · 2023-10-30T08:41:39Z

Hi @yiliu30, Thanks for pr and interest! First question, what is better in this approach than our already supported int8 version of models?

Hi @StochasticRomanAgeev , thanks for your reply. There are several improvements:

More precision options, we offer a broader range of low-precision data types, including 4 bits (INT4, FP4, and NF4).
Better accuracy, we support a set of fashion algorithms, such as RTN, AWQ, and TEQ. (some test results)
Faster inference, we support a set of Highly Optimized Low Precision Kernels which leverage the AMX_INT8, AVX512_VNNI, and AVX512F ISA.

StochasticRomanAgeev · 2023-10-31T10:19:56Z

Thanks for pr!
I have a small changes request for you -

We need to add intel_extension_for_transformers as an optional dependancy.
The way we want to use it - inside int8 model check if we are in cpu mode - then do import and use this config. This is because we want to minimise additional steps for user to use your extension.

StochasticRomanAgeev · 2023-10-31T10:22:37Z

I am talking about this if branch, you need just to integrate your code there.

yiliu30 · 2023-11-01T03:25:48Z

Thanks for pr! I have a small changes request for you -

We need to add intel_extension_for_transformers as an optional dependancy.

The way we want to use it - inside int8 model check if we are in cpu mode - then do import and use this config. This is because we want to minimise additional steps for user to use your extension.

Thanks and agree with your suggestion, we will work on a new PR for it soon :)

yiliu30 · 2023-11-01T05:34:19Z

I am talking about this if branch, you need just to integrate your code there.

Hi @StochasticRomanAgeev, #268 is the initial implementation following your suggestion, please take your time to review it.

yiliu30 mentioned this issue Oct 10, 2023

Integrate ITREX to support popular compression algorithms and highly optimized kernels #263

Closed

4 tasks

yiliu30 mentioned this issue Nov 1, 2023

Integrate ITREX to support int8 model on the CPU-only devices #268

Merged

4 tasks

yiliu30 mentioned this issue Nov 8, 2023

Update the CPU inference doc #271

Merged

StochasticRomanAgeev closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support more popular compression algorithms and highly optimized kernels on the CPU #264

[Feature Request] Support more popular compression algorithms and highly optimized kernels on the CPU #264

yiliu30 commented Oct 10, 2023 •

edited

Loading

StochasticRomanAgeev commented Oct 30, 2023

yiliu30 commented Oct 30, 2023 •

edited

Loading

StochasticRomanAgeev commented Oct 31, 2023

StochasticRomanAgeev commented Oct 31, 2023

yiliu30 commented Nov 1, 2023

yiliu30 commented Nov 1, 2023

[Feature Request] Support more popular compression algorithms and highly optimized kernels on the CPU #264

[Feature Request] Support more popular compression algorithms and highly optimized kernels on the CPU #264

Comments

yiliu30 commented Oct 10, 2023 • edited Loading

Usage

Supported Scope

Plan

StochasticRomanAgeev commented Oct 30, 2023

yiliu30 commented Oct 30, 2023 • edited Loading

StochasticRomanAgeev commented Oct 31, 2023

StochasticRomanAgeev commented Oct 31, 2023

yiliu30 commented Nov 1, 2023

yiliu30 commented Nov 1, 2023

yiliu30 commented Oct 10, 2023 •

edited

Loading

yiliu30 commented Oct 30, 2023 •

edited

Loading