From 26cd3c8f66436d8c039a600954f3cbe6593cf335 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 8 Nov 2023 07:52:28 +0800 Subject: [PATCH 1/4] update docs Signed-off-by: yiliu30 --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 0056016..8cda517 100644 --- a/README.md +++ b/README.md @@ -79,7 +79,8 @@ model = GenericLoraKbitModel('tiiuae/falcon-7b') # Run the fine-tuning model.finetune(dataset) ``` -4. __CPU inference__ - Now you can use just your CPU for inference of any LLM. _CAUTION : The inference process may be sluggish because CPUs lack the required computational capacity for efficient inference_. +4. __CPU inference__ - Now you can use just your CPU for inference of any LLM. For the CPU-only devices, we integrated [Itrex](https://github.com/intel/intel-extension-for-transformers) to conserve memory by compressing the model with [weight-only quantization algorithms](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md) and accelerate the inference by leveraging its highly optimized kernel on Intel platforms. + 5. __Batch integration__ - By tweaking the 'batch_size' in the .generate() and .evaluate() functions, you can expedite results. Using a 'batch_size' greater than 1 typically enhances processing efficiency. ```python # Make the necessary imports From 671b324745c409406c2a39af69e3f7f76327a9f1 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 8 Nov 2023 10:40:26 +0800 Subject: [PATCH 2/4] update docs Signed-off-by: yiliu30 --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 8cda517..9699fd0 100644 --- a/README.md +++ b/README.md @@ -79,7 +79,8 @@ model = GenericLoraKbitModel('tiiuae/falcon-7b') # Run the fine-tuning model.finetune(dataset) ``` -4. __CPU inference__ - Now you can use just your CPU for inference of any LLM. For the CPU-only devices, we integrated [Itrex](https://github.com/intel/intel-extension-for-transformers) to conserve memory by compressing the model with [weight-only quantization algorithms](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md) and accelerate the inference by leveraging its highly optimized kernel on Intel platforms. + +4. __CPU inference__ - The CPU, including notebook CPUs, is now fully equipped to handle LLM inference. We integrated [Itrex](https://github.com/intel/intel-extension-for-transformers) to conserve memory by compressing the model with [weight-only quantization algorithms](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md) and accelerate the inference by leveraging its highly optimized kernel on Intel platforms. 5. __Batch integration__ - By tweaking the 'batch_size' in the .generate() and .evaluate() functions, you can expedite results. Using a 'batch_size' greater than 1 typically enhances processing efficiency. ```python From 9129ef363bcdfd4fdae2148c40808d7468559d2b Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 8 Nov 2023 12:03:14 +0800 Subject: [PATCH 3/4] fix typo Signed-off-by: yiliu30 --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9699fd0..26ce59c 100644 --- a/README.md +++ b/README.md @@ -80,7 +80,7 @@ model = GenericLoraKbitModel('tiiuae/falcon-7b') model.finetune(dataset) ``` -4. __CPU inference__ - The CPU, including notebook CPUs, is now fully equipped to handle LLM inference. We integrated [Itrex](https://github.com/intel/intel-extension-for-transformers) to conserve memory by compressing the model with [weight-only quantization algorithms](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md) and accelerate the inference by leveraging its highly optimized kernel on Intel platforms. +4. __CPU inference__ - The CPU, including laptop CPUs, is now fully equipped to handle LLM inference. We integrated [Itrex](https://github.com/intel/intel-extension-for-transformers) to conserve memory by compressing the model with [weight-only quantization algorithms](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md) and accelerate the inference by leveraging its highly optimized kernel on Intel platforms. 5. __Batch integration__ - By tweaking the 'batch_size' in the .generate() and .evaluate() functions, you can expedite results. Using a 'batch_size' greater than 1 typically enhances processing efficiency. ```python From 7b4ff6eb903efa29267f553f88b5a1419bf6ce4a Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 8 Nov 2023 15:36:28 +0800 Subject: [PATCH 4/4] add sample code Signed-off-by: yiliu30 --- README.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 26ce59c..a1b857f 100644 --- a/README.md +++ b/README.md @@ -80,7 +80,20 @@ model = GenericLoraKbitModel('tiiuae/falcon-7b') model.finetune(dataset) ``` -4. __CPU inference__ - The CPU, including laptop CPUs, is now fully equipped to handle LLM inference. We integrated [Itrex](https://github.com/intel/intel-extension-for-transformers) to conserve memory by compressing the model with [weight-only quantization algorithms](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md) and accelerate the inference by leveraging its highly optimized kernel on Intel platforms. +4. __CPU inference__ - The CPU, including laptop CPUs, is now fully equipped to handle LLM inference. We integrated [IntelĀ® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers) to conserve memory by compressing the model with [weight-only quantization algorithms](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md) and accelerate the inference by leveraging its highly optimized kernel on Intel platforms. + +```python +# Make the necessary imports +from xturing.models import BaseModel + +# Initializes the model: quantize the model with weight-only algorithms +# and replace the linear with Itrex's qbits_linear kernel +model = BaseModel.create("llama2_int8") + +# Once the model has been quantized, do inferences directly +output = model.generate(texts=["Why LLM models are becoming so important?"]) +print(output) +``` 5. __Batch integration__ - By tweaking the 'batch_size' in the .generate() and .evaluate() functions, you can expedite results. Using a 'batch_size' greater than 1 typically enhances processing efficiency. ```python