Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support more popular compression algorithms and highly optimized kernels on the CPU #264

Closed
4 tasks done
yiliu30 opened this issue Oct 10, 2023 · 6 comments

Comments

@yiliu30
Copy link
Contributor

yiliu30 commented Oct 10, 2023

xTuring, is known for its efficient and straightforward fine-tuning support for popular LLMs, but it missing features related to popular compression algorithms, particularly weight-only quantization. These compression methods have been widely acknowledged for their efficiency and are commonly adopted in the industry. Furthermore, xTuring has limited support for CPU-side optimization.
Our team developed these quantization algorithms on Intel® Neural Compressor and Intel-Extension-for-Transformers. We want to integrate these algorithms into the xTuring. This integration aims to:

  • Offer weight-only quantization algorithms, including RTN, AWQ, and TEQ with ignorable accuracy loss
  • Provide optimized kernels for accelerating inference, especially on Intel CPUs

Usage

from xturing.models import BaseModel

# Sepcific the quantizatuion configuration
from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
woq_config = WeightOnlyQuantConfig(weight_dtype='int8')
model = BaseModel.create("gpt2", quantization_config=woq_config)

# Inference model with itrex's highly optimized kernels
output = model.generate(texts=["Why are the LLM models important?"])

Supported Scope

All currently supported models.

Plan

  • Initial draft PR supporting gpt2 for a quick review of the implementation details
  • Extend support to all models
  • UT and CI
  • Doc

@StochasticRomanAgeev @tushar2407

@StochasticRomanAgeev
Copy link
Contributor

Hi @yiliu30,
Thanks for pr and interest!
First question, what is better in this approach than our already supported int8 version of models?

@yiliu30
Copy link
Contributor Author

yiliu30 commented Oct 30, 2023

Hi @yiliu30, Thanks for pr and interest! First question, what is better in this approach than our already supported int8 version of models?

Hi @StochasticRomanAgeev , thanks for your reply. There are several improvements:

@StochasticRomanAgeev
Copy link
Contributor

Thanks for pr!
I have a small changes request for you -

  • We need to add intel_extension_for_transformers as an optional dependancy.
  • The way we want to use it - inside int8 model check if we are in cpu mode - then do import and use this config. This is because we want to minimise additional steps for user to use your extension.

@StochasticRomanAgeev
Copy link
Contributor

I am talking about this if branch, you need just to integrate your code there.

@yiliu30
Copy link
Contributor Author

yiliu30 commented Nov 1, 2023

Thanks for pr! I have a small changes request for you -

  • We need to add intel_extension_for_transformers as an optional dependancy.
  • The way we want to use it - inside int8 model check if we are in cpu mode - then do import and use this config. This is because we want to minimise additional steps for user to use your extension.

Thanks and agree with your suggestion, we will work on a new PR for it soon :)

@yiliu30
Copy link
Contributor Author

yiliu30 commented Nov 1, 2023

I am talking about this if branch, you need just to integrate your code there.

Hi @StochasticRomanAgeev, #268 is the initial implementation following your suggestion, please take your time to review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants